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Preface 


This  book  is  aimed  at  the  reader  who  wishes  to  gain  a  working  knowledge  of  time  series 
and  forecasting  methods  as  applied  in  economics,  engineering,  and  the  natural  and 
social  sciences.  Unlike  our  more  advanced  book,  Time  Series:  Theory  and  Methods , 
Brockwell  and  Davis  (1991),  this  one  requires  only  a  knowledge  of  basic  calculus, 
matrix  algebra  and  elementary  statistics  at  the  level,  for  example,  of  Mendenhall  et  al. 
(1990).  It  is  intended  for  upper-level  undergraduate  students  and  beginning  graduate 
students. 

The  emphasis  is  on  methods  and  the  analysis  of  data  sets.  The  professional  version 
of  the  time  series  package  ITSM2000,  for  Windows-based  PC,  enables  the  reader  to 
reproduce  most  of  the  calculations  in  the  text  (and  to  analyze  further  data  sets  of  the 
reader’s  own  choosing).  It  is  available  for  download,  together  with  most  of  the  data 
sets  used  in  the  book,  from  http://extras.springer.com.  Appendix  E  contains  a  detailed 
introduction  to  the  package. 

Very  little  prior  familiarity  with  computing  is  required  in  order  to  use  the  computer 
package.  The  book  can  also  be  used  in  conjunction  with  other  computer  packages  for 
handling  time  series.  Chapter  14  of  the  book  by  Venables  and  Ripley  (2003)  describes 
how  to  perform  many  of  the  calculations  using  S  and  R.  The  package  ITSMR  of  Weigt 
(2015)  can  be  used  in  R  to  reproduce  many  of  the  features  of  ITSM2000.  The  package 
Yuima,  also  for  R,  can  be  used  for  simulation  and  estimation  of  the  Levy-driven 
CARMA  processes  discussed  in  Section  11.5  (see  Iacus  and  Mercuri  (2015)).  Both 
of  these  packages  can  be  downloaded  from  https://cran.rproject.org/web/packages. 

There  are  numerous  problems  at  the  end  of  each  chapter,  many  of  which  involve 
use  of  the  programs  to  study  the  data  sets  provided. 

To  make  the  underlying  theory  accessible  to  a  wider  audience,  we  have  stated  some 
of  the  key  mathematical  results  without  proof,  but  have  attempted  to  ensure  that  the 
logical  structure  of  the  development  is  otherwise  complete.  (References  to  proofs  are 
provided  for  the  interested  reader.) 

There  is  sufficient  material  here  for  a  full-year  introduction  to  univariate  and 
multivariate  time  series  and  forecasting.  Chapters  1  through  6  have  been  used  for  sev¬ 
eral  years  in  introductory  one-semester  courses  in  univariate  time  series  at  Columbia 
University,  Colorado  State  University,  and  Royal  Melbourne  Institute  of  Technology. 
The  chapter  on  spectral  analysis  can  be  excluded  without  loss  of  continuity  by  readers 
who  are  so  inclined. 

In  view  of  the  explosion  of  interest  in  financial  time  series  in  recent  decades,  the 
third  edition  includes  a  new  chapter  (Chapter  7)  specifically  devoted  to  this  topic.  Some 
of  the  basic  tools  required  for  an  understanding  of  continuous-time  financial  time  series 
models  (Brownian  motion,  Levy  processes,  and  Ito  calculus)  have  also  been  added  as 
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Appendix  D,  and  a  new  Section  1 1.5  provides  an  introduction  to  continuous  parameter 
ARMA  (or  CARMA)  processes. 

The  diskette  containing  the  student  version  of  the  package  ITSM2000  is  no  longer 
included  with  the  book  since  the  professional  version  (which  places  no  limit  on  the 
length  of  the  series  to  be  studied)  can  now  be  downloaded  from  http://extras. springer, 
com  as  indicated  above.  A  tutorial  for  the  use  of  the  package  is  provided  as  Appendix  E 
and  a  searchable  file,  ITSM_HELP.pdf,  giving  more  detailed  instructions,  is  included 
with  the  package. 

We  are  greatly  indebted  to  the  readers  of  the  first  and  second  editions  of  the  book 
and  especially  to  Matthew  Calder,  coauthor  of  the  computer  package  ITSM2000  and 
to  Anthony  Brockwell,  both  of  whom  made  many  valuable  comments  and  suggestions. 
We  also  wish  to  thank  Colorado  State  University,  Columbia  University,  the  National 
Science  Foundation,  Springer- Verlag,  and  our  families  for  their  continuing  support 
during  the  preparation  of  this  third  edition. 

Fort  Collins,  CO,  USA  Peter  J.  Brockwell 

New  York,  NY,  USA  Richard  A.  Davis 

April,  2016 
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In  this  chapter  we  introduce  some  basic  ideas  of  time  series  analysis  and  stochastic 
processes.  Of  particular  importance  are  the  concepts  of  stationarity  and  the  autocovari¬ 
ance  and  sample  autocovariance  functions.  Some  standard  techniques  are  described 
for  the  estimation  and  removal  of  trend  and  seasonality  (of  known  period)  from 
an  observed  time  series.  These  are  illustrated  with  reference  to  the  data  sets  in 
Section  1.1.  The  calculations  in  all  the  examples  can  be  carried  out  using  the  time 
series  package  ITSM,  the  professional  version  of  which  is  available  at  http://extras. 
springer.com.  The  data  sets  are  contained  in  files  with  names  ending  in  .TSM.  For 
example,  the  Australian  red  wine  sales  are  filed  as  WINE. TSM.  Most  of  the  topics 
covered  in  this  chapter  will  be  developed  more  fully  in  later  sections  of  the  book.  The 
reader  who  is  not  already  familiar  with  random  variables  and  random  vectors  should 
first  read  Appendix  A,  where  a  concise  account  of  the  required  background  is  given. 


1 .1  Examples  of  Time  Series 

A  time  series  is  a  set  of  observations  xu  each  one  being  recorded  at  a  specific  time  t. 
A  discrete-time  time  series  (the  type  to  which  this  book  is  primarily  devoted)  is  one 
in  which  the  set  7q  of  times  at  which  observations  are  made  is  a  discrete  set,  as  is  the 
case,  for  example,  when  observations  are  made  at  fixed  time  intervals.  Continuous¬ 
time  time  series  are  obtained  when  observations  are  recorded  continuously  over  some 
time  interval,  e.g.,  when  7q  =  [0,  1]. 
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Chapter  1  Introduction 


Figure  1-1 

The  Australian  red  wine 
sales,  Jan.  1  980-Oct.  1 991 

Example  1.1.1 


Example  1.1.2 


Example  1.1.3 


Australian  Red  Wine  Sales;  WINE.TSM 

Figure  1-1  shows  the  monthly  sales  (in  kiloliters)  of  red  wine  by  Australian  winemak¬ 
ers  from  January  1980  through  October  1991.  In  this  case  the  set  7q  consists  of  the 
142  times  {(Jan.  1980),  (Feb.  1980),  ...,(Oct.  1991)}.  Given  a  set  of  n  observations 
made  at  uniformly  spaced  time  intervals,  it  is  often  convenient  to  rescale  the  time  axis 
in  such  a  way  that  To  becomes  the  set  of  integers  {1,2,  . . . ,  n}.  In  the  present  example 
this  amounts  to  measuring  time  in  months  with  (Jan.  1980)  as  month  1.  Then  To  is  the 
set  {1,  2,  ... ,  142}.  It  appears  from  the  graph  that  the  sales  have  an  upward  trend  and 
a  seasonal  pattern  with  a  peak  in  July  and  a  trough  in  January.  To  plot  the  data  using 
ITSM,  run  the  program  by  double-clicking  on  the  ITSM  icon  and  then  select  the  option 
File>Proj  ect  >Open>Univariate,  click  OK,  and  select  the  file  WINE.TSM. 
The  graph  of  the  data  will  then  appear  on  your  screen. 

□ 


All-Star  Baseball  Games,  1933-1995 

Figure  1-2  shows  the  results  of  the  all-star  games  by  plotting  xt ,  where 

II  if  the  National  Feague  won  in  year  t, 

—  1  if  the  American  Feague  won  in  year  t. 

This  is  a  series  with  only  two  possible  values,  ±1.  It  also  has  some  missing  values, 
since  no  game  was  played  in  1945,  and  two  games  were  scheduled  for  each  of  the 
years  1959-1962. 

□ 


Accidental  Deaths,  U.S.A.,  1973-1978;  DEATHS. TSM 

Fike  the  red  wine  sales,  the  monthly  accidental  death  figures  show  a  strong  seasonal 
pattern,  with  the  maximum  for  each  year  occurring  in  July  and  the  minimum  for  each 
year  occurring  in  February.  The  presence  of  a  trend  in  Figure  1-3  is  much  less  apparent 
than  in  the  wine  sales.  In  Section  1.5  we  shall  consider  the  problem  of  representing 
the  data  as  the  sum  of  a  trend,  a  seasonal  component,  and  a  residual  term. 

□ 


1.1  Examples  of  Time  Series 
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Figure  1-2 

Results  of  the 
all-star  baseball  games, 
1933-1995 


Figure  1-3 

The  monthly  accidental 
deaths  data,  1 973-1 978 

Example  1.1.4 


1940  1950  1960  1970  1980  1990 


A  Signal  Detection  Problem;  SIGNAL. TSM 
Figure  1-4  shows  simulated  values  of  the  series 

Xt  =  cos  (^j  +  Nt,  t  —  1,2,...,  200, 

where  { Nt }  is  a  sequence  of  independent  normal  random  variables,  with  mean  0 
and  variance  0.25.  Such  a  series  is  often  referred  to  as  signal  plus  noise ,  the  signal 
being  the  smooth  function,  St  =  cos(^)  in  this  case.  Given  only  the  data  Xh  how 
can  we  determine  the  unknown  signal  component?  There  are  many  approaches  to 
this  general  problem  under  varying  assumptions  about  the  signal  and  the  noise.  One 
simple  approach  is  to  smooth  the  data  by  expressing  Xt  as  a  sum  of  sine  waves  of 
various  frequencies  (see  Section  4.2)  and  eliminating  the  high-frequency  components. 
If  we  do  this  to  the  values  of  {X^}  shown  in  Figure  1-4  and  retain  only  the  lowest  3.5  % 
of  the  frequency  components,  we  obtain  the  estimate  of  the  signal  also  shown  as  the 
red  dashed  line  in  Figure  1-4.  The  waveform  of  the  signal  is  quite  close  to  that  of  the 
true  signal  in  this  case,  although  its  amplitude  is  somewhat  smaller. 

□ 
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Figure  1-4 

The  series  {Xj}  of 
Example  1 .1 .4 


Figure  1-5 

Population  of  the  U.S.A.  at 
1 0-year  intervals, 
1 790-1 990 

Example  1.1.5 


Example  1.1.6 


Population  of  the  U.S.A.,  1790-1990;  USPOP.TSM 


The  population  of  the  U.S.A.,  measured  at  10-year  intervals,  is  shown  in  Figure  1-5. 
The  graph  suggests  the  possibility  of  fitting  a  quadratic  or  exponential  trend  to  the 
data.  We  shall  explore  this  further  in  Section  1.3. 


□ 


Number  of  Strikes  Per  Year  in  the  U.S.A.,  1951-1980;  STRIKES. TSM 


The  annual  numbers  of  strikes  in  the  U.S.A.  for  the  years  1951-1980  are  shown  in 
Figure  1-6.  They  appear  to  fluctuate  erratically  about  a  slowly  changing  level. 


□ 


1 .2  Objectives  of  Time  Series  Analysis 
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1 .2  Objectives  of  Time  Series  Analysis 

The  examples  considered  in  Section  1.1  are  an  extremely  small  sample  from  the 
multitude  of  time  series  encountered  in  the  fields  of  engineering,  science,  sociology, 
and  economics.  Our  purpose  in  this  book  is  to  study  techniques  for  drawing  inferences 
from  such  series.  Before  we  can  do  this,  however,  it  is  necessary  to  set  up  a  hypothetical 
probability  model  to  represent  the  data.  After  an  appropriate  family  of  models  has 
been  chosen,  it  is  then  possible  to  estimate  parameters,  check  for  goodness  of  fit  to 
the  data,  and  possibly  to  use  the  fitted  model  to  enhance  our  understanding  of  the 
mechanism  generating  the  series.  Once  a  satisfactory  model  has  been  developed,  it 
may  be  used  in  a  variety  of  ways  depending  on  the  particular  field  of  application. 

The  model  may  be  used  simply  to  provide  a  compact  description  of  the  data.  We 
may,  for  example,  be  able  to  represent  the  accidental  deaths  data  of  Example  1.1.3  as 
the  sum  of  a  specified  trend,  and  seasonal  and  random  terms.  For  the  interpretation 
of  economic  statistics  such  as  unemployment  figures,  it  is  important  to  recognize 
the  presence  of  seasonal  components  and  to  remove  them  so  as  not  to  confuse 
them  with  long-term  trends.  This  process  is  known  as  seasonal  adjustment.  Other 
applications  of  time  series  models  include  separation  (or  filtering)  of  noise  from  signals 
as  in  Example  1.1.4,  prediction  of  future  values  of  a  series  such  as  the  red  wine 
sales  in  Example  1.1.1  or  the  population  data  in  Example  1.1.5,  testing  hypotheses 
such  as  global  warming  using  recorded  temperature  data,  predicting  one  series  from 
observations  of  another,  e.g.,  predicting  future  sales  using  advertising  expenditure  data, 
and  controlling  future  values  of  a  series  by  adjusting  parameters.  Time  series  models 
are  also  useful  in  simulation  studies.  For  example,  the  performance  of  a  reservoir 
depends  heavily  on  the  random  daily  inputs  of  water  to  the  system.  If  these  are  modeled 
as  a  time  series,  then  we  can  use  the  fitted  model  to  simulate  a  large  number  of 
independent  sequences  of  daily  inputs.  Knowing  the  size  and  mode  of  operation 
of  the  reservoir,  we  can  determine  the  fraction  of  the  simulated  input  sequences  that 
cause  the  reservoir  to  run  out  of  water  in  a  given  time  period.  This  fraction  will  then  be 
an  estimate  of  the  probability  of  emptiness  of  the  reservoir  at  some  time  in  the  given 
period. 
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1 .3  Some  Simple  Time  Series  Models 

An  important  part  of  the  analysis  of  a  time  series  is  the  selection  of  a  suitable  proba¬ 
bility  model  (or  class  of  models)  for  the  data.  To  allow  for  the  possibly  unpredictable 
nature  of  future  observations  it  is  natural  to  suppose  that  each  observation  xt  is  a 
realized  value  of  a  certain  random  variable  Xt. 


Definition  1.3.1 


A  time  series  model  for  the  observed  data  {xt}  is  a  specification  of  the  joint 
distributions  (or  possibly  only  the  means  and  covariances)  of  a  sequence  of  random 
variables  {XJ  of  which  f xt }  is  postulated  to  be  a  realization. 


Remark .  We  shall  frequently  use  the  term  time  series  to  mean  both  the  data  and  the 
process  of  which  it  is  a  realization.  □ 

A  complete  probabilistic  time  series  model  for  the  sequence  of  random  variables 
{X\ ,  X2, . . .}  would  specify  all  of  the  joint  distributions  of  the  random  vectors 
(X\ ,  . . . ,  Xn)' ,  n  —  1,  2,  . . .,  or  equivalently  all  of  the  probabilities 

P[X i  <  jci,  . . . ,  Xn  <  xn ],  — oo  <  jci,  . . . ,  xn  <  oo,  n—  1,2,.... 

Such  a  specification  is  rarely  used  in  time  series  analysis  (unless  the  data  are  generated 
by  some  well-understood  simple  mechanism),  since  in  general  it  will  contain  far  too 
many  parameters  to  be  estimated  from  the  available  data.  Instead  we  specify  only  the 
first-  and  second-order  moments  of  the  joint  distributions,  i.e.,  the  expected  values 
EXt  and  the  expected  products  E(Xt+hXt ),  t  =  1,  2,  . . .,  h  —  0,  1,  2,  . . .,  focusing 
on  properties  of  the  sequence  {A/}  that  depend  only  on  these.  Such  properties  of  {Ar} 
are  referred  to  as  second-order  properties.  In  the  particular  case  where  all  the  joint 
distributions  are  multivariate  normal,  the  second-order  properties  of  {Ar}  completely 
determine  the  joint  distributions  and  hence  give  a  complete  probabilistic  characteri¬ 
zation  of  the  sequence.  In  general  we  shall  lose  a  certain  amount  of  information  by 
looking  at  time  series  “through  second-order  spectacles”;  however,  as  we  shall  see 
in  Chapter  2,  the  theory  of  minimum  mean  squared  error  linear  prediction  depends 
only  on  the  second-order  properties,  thus  providing  further  justification  for  the  use 
of  the  second-order  characterization  of  time  series  models. 

Figure  1-7  shows  one  of  many  possible  realizations  of  {St,  t  =  1, . . . ,  200},  where 
{ St }  is  a  sequence  of  random  variables  specified  in  Example  1.3.3  below.  In  most 
practical  problems  involving  time  series  we  see  only  one  realization.  For  example, 
there  is  only  one  available  realization  of  Fort  Collins’s  annual  rainfall  for  the  years 
1900-1996,  but  we  imagine  it  to  be  one  of  the  many  sequences  that  might  have 
occurred.  In  the  following  examples  we  introduce  some  simple  time  series  models. 
One  of  our  goals  will  be  to  expand  this  repertoire  so  as  to  have  at  our  disposal  a  broad 
range  of  models  with  which  to  try  to  match  the  observed  behavior  of  given  data  sets. 


1 .3.1  Some  Zero-Mean  Models 
Example  1 .3.1  iid  Noise 

Perhaps  the  simplest  model  for  a  time  series  is  one  in  which  there  is  no  trend  or  seasonal 
component  and  in  which  the  observations  are  simply  independent  and  identically 
distributed  (iid)  random  variables  with  zero  mean.  We  refer  to  such  a  sequence 


1.3  Some  Simple  Time  Series  Models 
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Example  1.3.2 


Example  1.3.3 


of  random  variables  X\,X2, . . .  as  iid  noise.  By  definition  we  can  write,  for  any 
positive  integer  n  and  real  numbers x\,  ...  ,xn, 

P[X  l  <xi,...,Xn<  xn]  =  P[X  i  <  xi]  •  •  •  P[Xn  <  xn\  =  F(x  i)  •  •  •  F(xn), 

where  F(-)  is  the  cumulative  distribution  function  (see  Section  A.l)  of  each  of 
the  identically  distributed  random  variables  X\,X2,  ... .  In  this  model  there  is  no 
dependence  between  observations.  In  particular,  for  all  h  >  1  and  all  x,  x\,  . . . ,  xn, 

P[X-n-\-h  —  —  X\ ,  .  .  .  ,  Xn  —  Xn\  —  P[Xn- \-h  ^  x], 

showing  that  knowledge  of  X\,  . . . ,  Xn  is  of  no  value  for  predicting  the  behavior  of 
Xn+h.  Given  the  values  of  X\ , . . . ,  Xn,  the  function  /  that  minimizes  the  mean  squared 
error  E[(X/7+/7  — f(X\ ,  . . . ,  2Q)2]  is  in  fact  identically  zero  (see  Problem  1.2).  Although 
this  means  that  iid  noise  is  a  rather  uninteresting  process  for  forecasters,  it  plays  an 
important  role  as  a  building  block  for  more  complicated  time  series  models. 

□ 


A  Binary  Process 

As  an  example  of  iid  noise,  consider  the  sequence  of  iid  random  variables  { Xt ,  t  — 
1,2,...,}  with 

P[Xt=l\=p,  P[Xt  =  -l]  =  l-p, 

where  p  =  The  time  series  obtained  by  tossing  a  penny  repeatedly  and  scoring  +1 
for  each  head  and  —1  for  each  tail  is  usually  modeled  as  a  realization  of  this  process. 
A  priori  we  might  well  consider  the  same  process  as  a  model  for  the  all-star  baseball 
games  in  Example  1.1.2.  However,  even  a  cursory  inspection  of  the  results  from  1963— 
1982,  which  show  the  National  League  winning  19  of  20  games,  casts  serious  doubt 
on  the  hypothesis  P[Xt  =  1]  = 

□ 


Random  Walk 

The  random  walk  { St ,  t  —  0,  1,  2, . . .}  (starting  at  zero)  is  obtained  by  cumulatively 
summing  (or  “integrating”)  iid  random  variables.  Thus  a  random  walk  with  zero  mean 
is  obtained  by  defining  Sq  =  0  and 

St  =  X\  +  X2  +  •  •  •  +  Xf ,  for  t  =  1,2,..., 

where  {2/}  is  iid  noise.  If  {Xt}  is  the  binary  process  of  Example  1.3.2,  then  {St,  t  = 
0,  1,  2,  . . . , }  is  called  a  simple  symmetric  random  walk.  This  walk  can  be  viewed 
as  the  location  of  a  pedestrian  who  starts  at  position  zero  at  time  zero  and  at  each 
integer  time  tosses  a  fair  coin,  stepping  one  unit  to  the  right  each  time  a  head  appears 
and  one  unit  to  the  left  for  each  tail.  A  realization  of  length  200  of  a  simple  symmetric 
random  walk  is  shown  in  Figure  1-7.  Notice  that  the  outcomes  of  the  coin  tosses  can 
be  recovered  from  { St ,  t  =  0,  1, . . .}  by  differencing.  Thus  the  result  of  the  tth  toss  can 
be  found  from  St  —  St-\  —Xt. 

□ 


1 .3.2  Models  with  Trend  and  Seasonality 

In  several  of  the  time  series  examples  of  Section  1.1  there  is  a  clear  trend  in  the  data. 
An  increasing  trend  is  apparent  in  both  the  Australian  red  wine  sales  (Figure  1-1)  and 
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Figure  1-7 

One  realization  of  a  sim¬ 
ple  random  walk  {St,  t  = 
0,  1,2,  ...,200} 


Example  1.3.4 


the  population  of  the  U.S.A.  (Figure  1-5).  In  both  cases  a  zero-mean  model  for  the  data 
is  clearly  inappropriate.  The  graph  of  the  population  data,  which  contains  no  apparent 
periodic  component,  suggests  trying  a  model  of  the  form 

Xt  —  mt  +  Yt, 

where  mt  is  a  slowly  changing  function  known  as  the  trend  component  and  Yt  has 
zero  mean.  A  useful  technique  for  estimating  mt  is  the  method  of  least  squares  (some 
other  methods  are  considered  in  Section  1.5). 

In  the  least  squares  procedure  we  attempt  to  fit  a  parametric  family  of  functions, 

e.g., 

r\ 

mt  =  clq  +  a\t  +  ci2t  ,  (1.3.1) 

to  the  data  {x\ , . . . ,  xn}  by  choosing  the  parameters,  in  this  illustration  ao,  and  a 2,  to 
minimize  YTt=\ ( xt  ~  mt )2-  This  method  of  curve  fitting  is  called  least  squares  regres¬ 
sion  and  can  be  carried  out  using  the  program  ITSM  and  selecting  the  Regression 
option. 

Population  of  the  U.S.A. ,  1790-1990 

To  fit  a  function  of  the  form  (1.3.1)  to  the  population  data  shown  in  Figure  1-5  we 
relabel  the  time  axis  so  that  t  —  1  corresponds  to  1790  and  t  =  21  corresponds 
to  1990.  Run  ITSM,  select  File>Proj  ect  >Open>Univariate,  and  open  the 
file  USPOP.TSM.  Then  select  Regression>Specify,  choose  Polynomial 
Regression  with  order  equal  to  2,  and  click  OK.  Finally,  selecting  the  option 
Regression>Estimation>Least  Squares,  gives  the  following  estimated 
parameter  values  in  the  model  (1.3.1): 

a0  =  6.9579  x  106, 

ai  =  -2.1599  x  106, 
and 

a2  =  6.5063  x  105. 


1.3  Some  Simple  Time  Series  Models 
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Figure  1-8 

Population  of  the  U.S.A. 
showing  the  quadratic  trend 
fitted  by  least  squares 


Example  1.3.5 


A  graph  of  the  fitted  function  is  shown  with  the  original  data  in  Figure  1-8.  The 
estimated  values  of  the  noise  process  Yt,  1  <  t  <  21,  are  the  residuals  obtained  by 
subtraction  of  m,  —  ao  +  a\t  +  atf1  from  xt. 

The  estimated  trend  component  mt  furnishes  us  with  a  natural  predictor  of  future 
values  of  Xt.  For  example,  if  we  estimate  the  noise  F22  by  its  mean  value,  i.e.,  zero, 
then  (1.3.1)  gives  the  estimated  U.S.  population  for  the  year  2000  as 

m22  =  6.9579  x  106  -  2.1599  x  106  x  22  +  6.5063  x  105  x  222  =  274.35  x  106. 

However,  if  the  residuals  {Fr}  are  highly  correlated,  we  may  be  able  to  use  their  values 
to  give  a  better  estimate  of  F22  and  hence  of  the  population  X22  in  the  year  2000. 

□ 

Level  of  Lake  Huron  1875-1972;  LAKE.DAT 

A  graph  of  the  level  in  feet  of  Lake  Huron  (reduced  by  570)  in  the  years  1875-1972 
is  displayed  in  Figure  1-9.  Since  the  lake  level  appears  to  decline  at  a  roughly  linear 
rate,  ITSM  was  used  to  fit  a  model  of  the  form 

Xf  —  clq  +  d\t  +  Yt,  t  —  1, . . . ,  98  (1.3.2) 

(with  the  time  axis  relabeled  as  in  Example  1.3.4).  The  least  squares  estimates  of  the 
parameter  values  are 

ao  —  10.202  and  ci\  —  —0.0242. 

(The  resulting  least  squares  line,  ao+a\t ,  is  also  displayed  in  Figure  1-9.)  The  estimates 
of  the  noise,  Yt ,  in  the  model  (1.3.2)  are  the  residuals  obtained  by  subtracting  the 
least  squares  line  from  xt  and  are  plotted  in  Figure  1-10.  There  are  two  interesting 
features  of  the  graph  of  the  residuals.  The  first  is  the  absence  of  any  discernible  trend. 
The  second  is  the  smoothness  of  the  graph.  (In  particular,  there  are  long  stretches  of 
residuals  that  have  the  same  sign.  This  would  be  very  unlikely  to  occur  if  the  residuals 
were  observations  of  iid  noise  with  zero  mean.)  Smoothness  of  the  graph  of  a  time 
series  is  generally  indicative  of  the  existence  of  some  form  of  dependence  among  the 
observations. 

Such  dependence  can  be  used  to  advantage  in  forecasting  future  values  of  the 
series.  If  we  were  to  assume  the  validity  of  the  fitted  model  with  iid  residuals  {Yt},  then 
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Figure  1-9 

Level  of  Lake  Huron 
1 875-1 972  showing  the 
line  fitted  by  least  squares 


o 


Figure  1-10 

Residuals  from  fitting  a  line 
to  the  Lake  Huron 
data  in  Figure  1  -9 


the  minimum  mean  squared  error  predictor  of  the  next  residual  (Y99)  would  be  zero 
(by  Problem  1.2).  However,  Figure  1-10  strongly  suggests  that  F99  will  be  positive. 

How  then  do  we  quantify  dependence,  and  how  do  we  construct  models 
for  forecasting  that  incorporate  dependence  of  a  particular  type?  To  deal  with 
these  questions,  Section  1.4  introduces  the  autocorrelation  function  as  a  measure 
of  dependence,  and  stationary  processes  as  a  family  of  useful  models  exhibiting  a 
wide  variety  of  dependence  structures. 

□ 

Harmonic  Regression 

Many  time  series  are  influenced  by  seasonally  varying  factors  such  as  the  weather, 
the  effect  of  which  can  be  modeled  by  a  periodic  component  with  fixed  known  period. 
For  example,  the  accidental  deaths  series  (Figure  1-3)  shows  a  repeating  annual  pattern 
with  peaks  in  July  and  troughs  in  February,  strongly  suggesting  a  seasonal  factor  with 
period  12.  In  order  to  represent  such  a  seasonal  effect,  allowing  for  noise  but  assuming 
no  trend,  we  can  use  the  simple  model, 
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Example  1.3.6 


Figure  1-11 

The  estimated  harmonic 
component  of  the 
accidental  deaths 
data  from  ITSM 


Xt  =  st  +  Yt, 

where  st  is  a  periodic  function  of  t  with  period  d  (st-d  =  st).  A  convenient  choice  for 
st  is  a  sum  of  harmonics  (or  sine  waves)  given  by 

k 

st  =  ao  +  cos (Xjt)  +  bj  sin (kjt)),  (1.3.3) 

7=1 

where  do,  d\,  . . . ,  d^  and  b\,  . . . ,  are  unknown  parameters  and  . . . ,  Xk  are  fixed 
frequencies,  each  being  some  integer  multiple  of  2jt /d.  To  carry  out  harmonic  regres¬ 
sion  using  ITSM,  select  Regression>Specify,  and  check  the  two  boxes, 
Include  intercept  term  and  Harmonic  Regression.  Then  specify  the 
number  of  harmonics  [k  in  equation  (1.3.3)]  and  enter  k  integer-valued  Fourier  indices 
fi, . . . ,  fk-  For  a  sine  wave  with  period  d ,  set  f\  —  n/d ,  where  n  is  the  number  of 
observations  in  the  time  series.  (If  n/d  is  not  an  integer,  you  will  need  to  delete  a  few 
observations  from  the  beginning  of  the  series  to  make  it  so.)  The  other  k  —  1  Fourier 
indices  should  be  positive  integer  multiples  of  the  first,  corresponding  to  harmonics 
of  the  fundamental  sine  wave  with  period  d.  Thus  to  fit  a  single  sine  wave  with 
period  365  to  365  daily  observations  we  would  choose  k—  1  and  f\  —  1.  To  fit  a  linear 
combination  of  sine  waves  with  periods  365 /jj=  1 ,  . . . ,  4,  we  would  choose  k  =  4  and 
fj  =j,  j=  1,  . . . ,  4.  Once  k  and  the  frequencies  f\, ... ,  f\  have  been  specified,  click 
OK  and  then  select  Regression>Estimation>Least  Squares  to  obtain  the 
required  coefficients.  To  see  how  well  the  fitted  function  matches  the  data,  select 
Regression>Show  fit. 

Accidental  Deaths 

To  fit  a  sum  of  two  harmonics  with  periods  12  months  and  6  months  to  the  monthly 
accidental  deaths  data x\,  ...  ,xn  with  n  —  72,  we  choose  k  =  2,  f\  =  n/12  =  6,  and 
f2  =  n/6  =  12.  Using  ITSM  as  described  above,  we  obtain  the  fitted  function  shown 
in  Figure  1-11.  As  can  be  seen  from  the  figure,  the  periodic  character  of  the  series  is 
captured  reasonably  well  by  this  fitted  function.  In  practice,  it  is  worth  experimenting 
with  several  different  combinations  of  harmonics  in  order  to  find  a  satisfactory  estimate 
of  the  seasonal  component.  The  program  ITSM  also  allows  fitting  a  linear  combination 


12 


Chapter  1  Introduction 


of  harmonics  and  polynomial  trend  by  checking  both  Harmonic  Regression 
and  Polynomial  Regressionin  the  Regression>Specif  ication dialog 
box.  Other  methods  for  dealing  with  seasonal  variation  in  the  presence  of  trend  are 
described  in  Section  1.5. 

□ 


1 .3.3  A  General  Approach  to  Time  Series  Modeling 

The  examples  of  the  previous  section  illustrate  a  general  approach  to  time  series 
analysis  that  will  form  the  basis  for  much  of  what  is  done  in  this  book.  Before 
introducing  the  ideas  of  dependence  and  stationarity,  we  outline  this  approach  to 
provide  the  reader  with  an  overview  of  the  way  in  which  the  various  ideas  of  this 
chapter  fit  together. 

•  Plot  the  series  and  examine  the  main  features  of  the  graph,  checking  in  particular 
whether  there  is 

(a)  a  trend, 

(b)  a  seasonal  component, 

(c)  any  apparent  sharp  changes  in  behavior, 

(d)  any  outlying  observations. 

•  Remove  the  trend  and  seasonal  components  to  get  stationary  residuals  (as  defined 
in  Section  1.4).  To  achieve  this  goal  it  may  sometimes  be  necessary  to  apply 
a  preliminary  transformation  to  the  data.  For  example,  if  the  magnitude  of  the 
fluctuations  appears  to  grow  roughly  linearly  with  the  level  of  the  series,  then 
the  transformed  series  {lnXi,  . . . ,  \nXn)  will  have  fluctuations  of  more  constant 
magnitude.  See,  for  example,  Figures  1-1  and  1-17.  (If  some  of  the  data  are 
negative,  add  a  positive  constant  to  each  of  the  data  values  to  ensure  that  all 
values  are  positive  before  taking  logarithms.)  There  are  several  ways  in  which 
trend  and  seasonality  can  be  removed  (see  Section  1.5),  some  involving  estimating 
the  components  and  subtracting  them  from  the  data,  and  others  depending  on 
differencing  the  data,  i.e.,  replacing  the  original  series  {Xr}  by  {Yt  \=  Xt  —  Xt_d) 
for  some  positive  integer  d.  Whichever  method  is  used,  the  aim  is  to  produce  a 
stationary  series,  whose  values  we  shall  refer  to  as  residuals. 

•  Choose  a  model  to  fit  the  residuals,  making  use  of  various  sample  statistics 
including  the  sample  autocorrelation  function  to  be  defined  in  Section  1.4. 

•  Forecasting  will  be  achieved  by  forecasting  the  residuals  and  then  inverting  the 
transformations  described  above  to  arrive  at  forecasts  of  the  original  series  {Xr}. 

•  An  extremely  useful  alternative  approach  touched  on  only  briefly  in  this  book  is  to 
express  the  series  in  terms  of  its  Fourier  components,  which  are  sinusoidal  waves 
of  different  frequencies  (cf.  Example  1.1.4).  This  approach  is  especially  important 
in  engineering  applications  such  as  signal  processing  and  structural  design.  It  is 
important,  for  example,  to  ensure  that  the  resonant  frequency  of  a  structure  does 
not  coincide  with  a  frequency  at  which  the  loading  forces  on  the  structure  have  a 
particularly  large  component. 
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Loosely  speaking,  a  time  series  { Xt ,  f  =  0,  ±1, . . .}  is  said  to  be  stationary  if  it  has  sta¬ 
tistical  properties  similar  to  those  of  the  “time- shifted”  series  {X/+/?,  r  =  0,  ±1, . . .}, 
for  each  integer  h.  Restricting  attention  to  those  properties  that  depend  only  on  the 
first-  and  second-order  moments  of  {Xr},  we  can  make  this  idea  precise  with  the 
following  definitions. 


Definition  1.4.1 


Let  {X^}  be  a  time  series  with  E(Xf)  <  oo.  The  mean  function  of  {X^}  is 

=  E(Xt). 

The  covariance  function  of  {Xr}  is 

Yx(r,  s)  =  Cov(Xr,  X,)  =  E[(Xr  -  fix(r))(Xs  -  fix(s))] 
for  all  integers  r  and  s. 


Definition  1.4.2 


{X^}  is  (weakly)  stationary  if 

(i)  /JLX(t)  is  independent  of  t, 
and 

(ii)  yx(t  +  h,  t)  is  independent  of  t  for  each  h. 


Remark  1.  Strict  stationarity  of  a  time  series  {Xt,  t  —  0,  ±1,  . . .}  is  defined  by  the 
condition  that  (X\ , . . . ,  Xn)  and  (X\ +/z,  . . . ,  Xn+h)  have  the  same  joint  distributions  for 
all  integers  h  and  n  >  0.  It  is  easy  to  check  that  if  {XJ  is  strictly  stationary  and  EX]  < 
oo  for  all  t,  then  {XJ  is  also  weakly  stationary  (Problem  1.3).  Whenever  we  use  the 
term  stationary  we  shall  mean  weakly  stationary  as  in  Definition  1.4.2,  unless  we 
specifically  indicate  otherwise.  □ 


Remark  2.  In  view  of  condition  (ii),  whenever  we  use  the  term  covariance  function 
with  reference  to  a  stationary  time  series  {Xr}  we  shall  mean  the  function  yx  of  one 
variable,  defined  by 

Yx(h)  :=  Yxih ,  0)  =  yx(t  +  h,t). 

The  function  yx (•)  will  be  referred  to  as  the  autocovariance  function  and  yx (h)  as  its 
value  at  lag  h.  □ 


Definition  1.4.3 


Let  {Xt}  be  a  stationary  time  series.  The  autocovariance  function  (ACVF)  of 
{X^}  at  lag  h  is 


Yxih)  =  Co  \(Xt+h,Xt). 


The  autocorrelation  function  (ACF)  of  {XJ  at  lag  h  is 


Px(h)  = 


Yxih) 

YxiO) 


Cor(X?+/2,  Xt). 
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Example  1.4.1 


Example  1.4.2 


Example  1 .4.3 


In  the  following  examples  we  shall  frequently  use  the  easily  verified  linearity  prop¬ 
erty  of  covariances,  that  if  EX2  <  oo,  E Y2  <  oo,  EZ2  <  oo  and  a,  b,  and  c  are  any 
real  constants,  then 

Cov(aX  +  bY  +  c,Z)  =  a  Cov(X,  Z)  +  b  Co v(E,  Z). 
iid  Noise 

If  {Xf}  is  iid  noise  and  E{X2)  =  a2  <  oo,  then  the  first  requirement  of  Def¬ 
inition  1.4.2  is  obviously  satisfied,  since  E(Xt)  —  0  for  all  t.  By  the  assumed 
independence, 

I  a2,  if /z  =  0, 

0,  if/z/0, 

which  does  not  depend  on  t.  Hence  iid  noise  with  finite  second  moment  is  stationary. 
We  shall  use  the  notation 

{X,}  ~  IID  (0,  a2) 

to  indicate  that  the  random  variables  Xt  are  independent  and  identically  distributed 
random  variables,  each  with  mean  0  and  variance  a2. 

□ 


White  Noise 

If  {XJ  is  a  sequence  of  uncorrelated  random  variables,  each  with  zero  mean  and 
variance  a2,  then  clearly  {Xr}  is  stationary  with  the  same  covariance  function  as  the 
iid  noise  in  Example  1.4.1.  Such  a  sequence  is  referred  to  as  white  noise  (with  mean 
0  and  variance  a2).  This  is  indicated  by  the  notation 

{X,}  -  WN  (0,  a2)  . 

Clearly,  every  IID(0,  a2)  sequence  is  WN(0,  a2)  but  not  conversely  (see  Problem  1.8 
and  the  ARCH(l)  process  of  Section  11.3). 

□ 


The  Random  Walk 

If  {St}  is  the  random  walk  defined  in  Example  1.3.3  with  {Xr}  as  in  Example  1.4.1, 
then  ESt  =  0,  E(S2)  =  t a2  <  oo  for  all  t ,  and,  for  h  >  0, 

ys(t  +  h,  t)  =  Co\(St+h,  St ) 

—  Co \(St  +  Xt+i  +  •  •  •  +  Xt+h,  St) 

=  Cov(S),  5,) 

=  to2 . 

Since  ys(t  +  h,  t)  depends  on  t ,  the  series  { St }  is  not  stationary. 

□ 


1 .4  Stationary  Models  and  the  Autocorrelation  Function 


15 


Example  1.4.4 


Example  1 .4.5 


First-Order  Moving  Average  or  MA(1)  Process 
Consider  the  series  defined  by  the  equation 


xt  =  Zt  +  eZt-i,  t  =  o,±i 


(1.4.1) 


where  {Z,j  ~  WN  (0,  a2)  and  9  is  a  real-valued  constant.  From  (1.4.1)  we  see  that 
EX,  =  0,  EX]  =  cr2(l  +  92)  <  oo,  and 


a2  (l  +  92) ,  if  h  =  0, 
yx(t  +  h,t)  =  ^  o20,  if  h  =  ±1 


0, 


if  \h\  >  1. 


Thus  the  requirements  of  Definition  1.4.2  are  satisfied,  and  { Xt }  is  stationary.  The 
autocorrelation  function  of  {Xt}  is 


Px(h) 


I. 


if  h  =  0, 


9/  (l  +  92) ,  ifh  =  ±l 


0. 


if  \h\  >  1. 


(1.4.2) 

□ 


First-Order  Autoregression  or  AR(1)  Process 

Let  us  assume  now  that  {Xr}  is  a  stationary  series  satisfying  the  equations 

Xt  =  (fiXt-x  +  %,  t  =  0,dhl,...,  (1.4.3) 

where  {Z,}  ~  WN(0,  a2),  \<f>\  <  1,  and  Zt  is  uncorrelated  with  Xs  for  each  s  <  t.  (We 
shall  show  in  Section  2.2  that  there  is  in  fact  exactly  one  such  solution  of  (1.4.3).)  By 
taking  expectations  on  each  side  of  (1.4.3)  and  using  the  fact  that  EZ,  =  0,  we  see  at 
once  that 


EX,  =  0. 

To  find  the  autocorrelation  function  of  (A,}  we  multiply  each  side  of  (1.4.3)  by  A,_/, 
(h  >  0)  and  then  take  expectations  to  get 


Yx(h)  =  Cov(X„  X,_h) 

=  Cov(0Xf_i,  X,_h)  +  Co v(Z„  X,_h) 

=  4>Yx(h  -  1)  +  0  =  •  •  •  =  <t>hYx( 0). 

Observing  that  y(/i)  =  y(— h)  and  using  Definition  1.4.3,  we  find  that 


Px(h)  = 


Yx(h) 

Yx(  0) 


h  =  0,  ±1, 


It  follows  from  the  linearity  of  the  covariance  function  in  each  of  its  arguments  and 
the  fact  that  Z,  is  uncorrelated  with  X,-\  that 


Yx( 0)  =  Cov(Xr,  X,)  =  Cov((pX,—\  +  Z,,  +  Zt)  =  (j>2Yx( 0)  +  ff 2 

and  hence  that  Kv(0)  =  cr2/  (l  —  (fr ). 


□ 
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Definition  1.4.4 


Example  1.4.6 


1 .4.1  The  Sample  Autocorrelation  Function 

Although  we  have  just  seen  how  to  compute  the  autocorrelation  function  for  a  few 
simple  time  series  models,  in  practical  problems  we  do  not  start  with  a  model,  but 
with  observed  data  {xi,  X2,  . . . ,  xn).  To  assess  the  degree  of  dependence  in  the  data 
and  to  select  a  model  for  the  data  that  reflects  this,  one  of  the  important  tools  we  use 
is  the  sample  autocorrelation  function  (sample  ACF)  of  the  data.  If  we  believe  that 
the  data  are  realized  values  of  a  stationary  time  series  {Xr},  then  the  sample  ACF  will 
provide  us  with  an  estimate  of  the  ACF  of  {XJ.  This  estimate  may  suggest  which  of 
the  many  possible  stationary  time  series  models  is  a  suitable  candidate  for  representing 
the  dependence  in  the  data.  For  example,  a  sample  ACF  that  is  close  to  zero  for  all 
nonzero  lags  suggests  that  an  appropriate  model  for  the  data  might  be  iid  noise.  The 
following  definitions  are  natural  sample  analogues  of  those  for  the  autocovariance  and 
autocorrelation  functions  given  earlier  for  stationary  time  series  models. 


Let  jci  , . . . ,  xn  be  observations  of  a  time  series.  The  sample  mean  of  x\ , . . . ,  xn  is 


x 


1  ” 

= 

n  L — ' 

t=  l 


The  sample  autocovariance  function  is 

n—  \h\ 

9(h)  :=  n~x  (xt+\h\  —  x)(xt—x),  —n<h<  n. 


t= l 


The  sample  autocorrelation  function  is 

y(h) 


P(h)  = 


9(0) 


—n  <  h  <  n. 


Remark  3.  For  h  >  0,  y  (h)  is  approximately  equal  to  the  sample  covariance  of  the  n— 
h  pairs  of  observations  (x\ ,  x\ +/*),  fe,  *2+/*),  •  •  • ,  (*„_/*,  xn).  The  difference  arises  from 
use  of  the  divisor  n  instead  of  n  —  h  and  the  subtraction  of  the  overall  mean,  x,  from 
each  factor  of  the  summands.  Use  of  the  divisor  n  ensures  that  the  sample  covariance 

A 

matrix  Fn  :=  [y(i  — /)]±_  ,  is  nonnegative  definite  (see  Section  2.4.2). 

Remark  4.  Like  the  sample  covariance  matrix  defined  in  Remark  3,  the  sample 

A 

correlation  matrix  Rn  :=  \p(i  —  /)]%,  is  nonnegative  definite.  Each  of  its  diagonal 
elements  is  equal  to  1,  since  p(0)  =  1.  □ 


Figure  1-12  shows  a  simulated  sequence  of  200  iid  normal  random  variables  with 
mean  0  and  variance  1  (called  an  IID  N(0,  I)  sequence).  Figure  I-I3  shows  the 
corresponding  sample  autocorrelation  function  at  lags  0,  1 ,  . . . ,  40.  Since  p(h)  =  0  for 
h  >  0,  one  would  also  expect  the  corresponding  sample  autocorrelations  to  be  near  0.  It 
can  be  shown,  in  fact,  that  for  iid  noise  with  finite  variance,  the  sample  autocorrelations 
p(h),  h  >  0,  are  approximately  IID  N(0,  \/n)  for  n  large  (see  Brockwell  and 
Davis  (1991)  p.  222).  Hence,  approximately  95%  of  the  sample  autocorrelations 
should  fall  between  the  bounds  ±1.96 /^/n  (since  1.96  is  the  0.975  quantile  of  the 
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Figure  1-12 

200  simulated  values  of  1 1 D 

N(0,1 )  noise 


Figure  1-13 

The  sample  autocorrelation 
function  for  the  data  of 
Figure  1-12  showing  the 
bounds  ±1 .96 /yfn 


Lag 


standard  normal  distribution).  Therefore,  in  Figure  1-13  we  would  expect  roughly 
40(0.05)  =  2  values  to  fall  outside  the  bounds.  To  simulate  IID  N(0,  1)  noise  in 
ITSM,  select  File>Proj  ect >New>Univariate  then  Model>Simulate.  In 
the  resulting  dialog  box,  enter  200  for  the  required  Number  of  Observations. 
(The  remaining  entries  in  the  dialog  box  can  be  left  as  they  are,  since  the  model 
assumed  by  ITSM,  until  you  enter  another,  is  IID  N(0,  1)  noise.  If  you  wish  to 
reproduce  exactly  the  same  sequence  at  a  later  date,  record  the  Random  Number 
Seed  for  later  use.  By  specifying  different  values  for  the  random  number  seed  you  can 
generate  independent  realizations  of  your  time  series.)  Click  on  OK  and  you  will  see  the 
graph  of  your  simulated  series.  To  see  its  sample  autocorrelation  function  together  with 
the  autocorrelation  function  of  the  model  that  generated  it,  click  on  the  third  yellow 
button  at  the  top  of  the  screen  and  you  will  see  the  two  graphs  superimposed  (with  the 
latter  in  red.)  The  horizontal  lines  on  the  graph  are  the  bounds  ±1.96/ y/n. 


□ 
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Figure  1-14 

The  sample  autocorrelation 
function  for  the  Australian 
red  wine  sales  showing  the 
bounds  ±1 .96/^/n 


Lag 


Remark  5.  The  sample  autocovariance  and  autocorrelation  functions  can  be  com¬ 
puted  for  any  data  set  {xi,  . . .  ,xn)  and  are  not  restricted  to  observations  from  a 
stationary  time  series.  For  data  containing  a  trend,  \p(h)\  will  exhibit  slow  decay  as 
h  increases,  and  for  data  with  a  substantial  deterministic  periodic  component,  \p(h)\ 
will  exhibit  similar  behavior  with  the  same  periodicity.  (See  the  sample  ACF  of  the 
Australian  red  wine  sales  in  Figure  1-14  and  Problem  1.9.)  Thus  /)(•)  can  be  useful  as 
an  indicator  of  nonstationarity  (see  also  Section  6.1).  □ 


1 .4.2  A  Model  for  the  Lake  Huron  Data 

As  noted  earlier,  an  iid  noise  model  for  the  residuals  {y\,  . . . ,  ygg}  obtained  by  fitting 
a  straight  line  to  the  Lake  Huron  data  in  Example  1.3.5  appears  to  be  inappropriate. 
This  conclusion  is  confirmed  by  the  sample  ACF  of  the  residuals  (Figure  1-15),  which 
has  three  of  the  first  40  values  well  outside  the  bounds  ±1.96/\/98. 

The  roughly  geometric  decay  of  the  first  few  sample  autocorrelations  (with 
p(h+  1  )/p(h)  ~  0.7)  suggests  that  an  AR(1)  series  (with  </>  ~  0.7)  might  pro¬ 
vide  a  reasonable  model  for  these  residuals.  (The  form  of  the  ACF  for  an  AR(1)  process 
was  computed  in  Example  1.4.5.) 

To  explore  the  appropriateness  of  such  a  model,  consider  the  points  (y\,y2), 
(y2,  J3), . .  • ,  (j97,  J98)  plotted  in  Figure  1-16.  The  graph  does  indeed  suggest  a  linear 
relationship  between  yt  and  yt-\.  Using  simple  least  squares  estimation  to  fit  a  straight 
line  of  the  form  yt  —  ayt-\ ,  we  obtain  the  model 

Yt  =  0.791Fr_i  +  Zt,  (1.4.4) 

where  {Zt}  is  iid  noise  with  variance  —  0.791)t_i)2/97  =  0.5024.  The  sample 

ACF  of  the  estimated  noise  sequence  it  —  yt  —  0.791yr_i,  t  —  2,  . . . ,  98,  is  slightly 
outside  the  bounds  =tl.96/V97  at  lag  1  (p(l)  =  0.216),  but  it  is  inside  the  bounds  for 
all  other  lags  up  to  40.  This  check  that  the  estimated  noise  sequence  is  consistent  with 
the  iid  assumption  of  (1.4.3)  reinforces  our  belief  in  the  fitted  model.  More  goodness 
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Figure  1-15 

The  sample  autocorrelation 
function  for  the  Lake 
Huron  residuals  of 
Figure  1  -1 0  showing  the 
bounds  ±1 .96 /fin 


Lag 


Figure  1-16 

Scatter  plot  of  (yt_ -\ ,  yt), 
t  =  2,...  ,98, 
for  the  data  in  Figure  1  -1 0 
showing  the  least  squares 
regression  line  y  =  0.791x 


of  fit  tests  for  iid  noise  sequences  are  described  in  Section  1.6.  The  estimated  noise 
sequence  {zt}  in  this  example  passes  them  all,  providing  further  support  for  the  model 
(1.4.3). 

A  better  fit  to  the  residuals  in  equation  (1.3.2)  is  provided  by  the  second-order 
autoregression 

Yt  —  fiYt-i  +  fiYt-2  +  Zt,  (1.4.5) 

where  { Zt }  is  iid  noise  with  variance  a2.  This  is  analogous  to  a  linear  model  in  which  Yt 
is  regressed  on  the  previous  two  values  Yt_\  and  Yt_ 2  of  the  time  series.  The  least 
squares  estimates  of  the  parameters  0 1  and  02,  found  by  minimizing  Ylt-  U(yt  - 
0iyr_i  —  023T-2)2,  are  0i  =  1.002  and  02  =  —0.2834.  The  estimate  of  o2  is 
o2  —  ^=3(yr  — 0ij/-i  —  02T/-2)2/96  =  0.4460,  which  is  approximately  11  %  smaller 
than  the  estimate  of  the  noise  variance  for  the  AR(1)  model  (1.4.3).  The  improved  fit 

A  A 

is  indicated  by  the  sample  ACF  of  the  estimated  residuals,  yt  —  (piyt~i  —  023T-2,  which 
falls  well  within  the  bounds  d=1.96/\/96  for  all  lags  up  to  40. 
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1 .5  Estimation  and  Elimination  of  Trend  and  Seasonal  Components 

The  first  step  in  the  analysis  of  any  time  series  is  to  plot  the  data.  If  there  are  any 
apparent  discontinuities  in  the  series,  such  as  a  sudden  change  of  level,  it  may  be 
advisable  to  analyze  the  series  by  first  breaking  it  into  homogeneous  segments.  If 
there  are  outlying  observations,  they  should  be  studied  carefully  to  check  whether 
there  is  any  justification  for  discarding  them  (as  for  example  if  an  observation  has 
been  incorrectly  recorded).  Inspection  of  a  graph  may  also  suggest  the  possibility 
of  representing  the  data  as  a  realization  of  the  process  (the  classical  decomposition 
model) 

Xt  =  mt  +  st  +  Yt,  (1.5.1) 

where  mt  is  a  slowly  changing  function  known  as  a  trend  component,  st  is  a  function 
with  known  period  d  referred  to  as  a  seasonal  component,  and  Yt  is  a  random  noise 
component  that  is  stationary  in  the  sense  of  Definition  1.4.2.  If  the  seasonal  and  noise 
fluctuations  appear  to  increase  with  the  level  of  the  process,  then  a  preliminary  trans¬ 
formation  of  the  data  is  often  used  to  make  the  transformed  data  more  compatible 
with  the  model  (1.5.1).  Compare,  for  example,  the  red  wine  sales  in  Figure  1-1  with 
the  transformed  data,  Figure  1-17,  obtained  by  applying  a  logarithmic  transformation. 
The  transformed  data  do  not  exhibit  the  increasing  fluctuation  with  increasing  level 
that  was  apparent  in  the  original  data.  This  suggests  that  the  model  (1.5.1)  is  more 
appropriate  for  the  transformed  than  for  the  original  series.  In  this  section  we  shall 
assume  that  the  model  (1.5.1)  is  appropriate  (possibly  after  a  preliminary  transforma¬ 
tion  of  the  data)  and  examine  some  techniques  for  estimating  the  components  mt,  st, 
and  Yj  in  the  model. 

Our  aim  is  to  estimate  and  extract  the  deterministic  components  mt  and  st  in  the 
hope  that  the  residual  or  noise  component  Yt  will  turn  out  to  be  a  stationary  time  series. 
We  can  then  use  the  theory  of  such  processes  to  find  a  satisfactory  probabilistic  model 
for  the  process  Yt ,  to  analyze  its  properties,  and  to  use  it  in  conjunction  with  mt  and  st 
for  purposes  of  prediction  and  simulation  of  {X?}. 

Another  approach,  developed  extensively  by  Box  and  Jenkins  (1976),  is  to  apply 
differencing  operators  repeatedly  to  the  series  {Xr}  until  the  differenced  observations 
resemble  a  realization  of  some  stationary  time  series  { Wt}.  We  can  then  use  the  theory 


Figure  1-17 

The  natural  logarithms 
of  the  red  wine  data 
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of  stationary  processes  for  the  modeling,  analysis,  and  prediction  of  { Wt }  and  hence 
of  the  original  process.  The  various  stages  of  this  procedure  will  be  discussed  in  detail 
in  Chapters  5  and  6. 

The  two  approaches  to  trend  and  seasonality  removal,  (1)  by  estimation  of  mt  and  st 
in  (1.5.1)  and  (2)  by  differencing  the  series  {XJ,  will  now  be  illustrated  with  reference 
to  the  data  introduced  in  Section  1.1. 


1 .5.1  Estimation  and  Elimination  of  Trend  in  the  Absence  of  Seasonality 

In  the  absence  of  a  seasonal  component  the  model  (1.5.1)  becomes  the  following. 


Nonseasonal  Model  with  Trend: 

Xt  =  mt+Yt ,  f=l,  ...,w,  (1.5.2) 

where  EYt  —  0. 


(If  EYt  0,  then  we  can  replace  mt  and  Yt  in  (1.5.2)  with  mt  +  EYt  and  Yt  —  EYt , 
respectively.) 

Method  1:  Trend  Estimation 

Moving  average  and  spectral  smoothing  are  essentially  nonparametric  methods  for 
trend  (or  signal)  estimation  and  not  for  model  building.  Special  smoothing  filters  can 
also  be  designed  to  remove  periodic  components  as  described  under  Method  S 1  below. 
The  choice  of  smoothing  filter  requires  a  certain  amount  of  subjective  judgment,  and 
it  is  recommended  that  a  variety  of  filters  be  tried  in  order  to  get  a  good  idea  of  the 
underlying  trend.  Exponential  smoothing,  since  it  is  based  on  a  moving  average  of  past 
values  only,  is  often  used  for  forecasting,  the  smoothed  value  at  the  present  time  being 
used  as  the  forecast  of  the  next  value. 

To  construct  a  model  for  the  data  (with  no  seasonality)  there  are  two 
general  approaches,  both  available  in  ITSM.  One  is  to  fit  a  polynomial  trend 
(by  least  squares)  as  described  in  Method  1(d)  below,  then  to  subtract  the  fitted  trend 
from  the  data  and  to  find  an  appropriate  stationary  time  series  model  for  the  residuals. 
The  other  is  to  eliminate  the  trend  by  differencing  as  described  in  Method  2  and  then  to 
find  an  appropriate  stationary  model  for  the  differenced  series.  The  latter  method  has 
the  advantage  that  it  usually  requires  the  estimation  of  fewer  parameters  and  does  not 
rest  on  the  assumption  of  a  trend  that  remains  fixed  throughout  the  observation  period. 
The  study  of  the  residuals  (or  of  the  differenced  series)  is  taken  up  in  Section  1.6. 

(a)  Smoothing  with  a  finite  moving  average  filter.  Let  q  be  a  nonnegative  integer 

and  consider  the  two-sided  moving  average 

q 

Wt  =  (2q  +  l)~l  XH  (1-5.3) 

j=-q 

of  the  process  {Xr}  defined  by  (1.5.2).  Then  for  q  +  1  <  t  <  n  —  q, 

q  q 

wt  =  (2 q  +  l)-1  ^2  mH  +  (2<?  +  !)-1  ^2  Yt~j  ~  Mu  (1.5.4) 

j=-q  j=-q 

assuming  that  mt  is  approximately  linear  over  the  interval  \t  —  q,  t  +  q\  and  that  the 
average  of  the  error  terms  over  this  interval  is  close  to  zero  (see  Problem  1.11). 
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Figure  1-18 

Simple  5-term  moving 
average  rht  of  the  strike  data 
from  Figure  1  -6 


Example  1.5.1 


The  moving  average  thus  provides  us  with  the  estimates 

q 

mt  —  (2 q  +  l)-1  q  +  l  <  t  <  n  —  q.  (1.5.5) 

j=-q 

Since  Xt  is  not  observed  for  t  <  0  or  t  >  n,  we  cannot  use  (1.5.5)  for  t  <  q  or 
t  >  n  —  q.  The  program  ITSM  deals  with  this  problem  by  defining  Xt  :=  X\  for 
t  <  1  and  Xt  \=  Xn  for  t  >  n. 

The  result  of  applying  the  moving-average  filter  (1.5.5)  with  q  =  2  to  the  strike  data 

A 

of  Figure  1-6  is  shown  in  Figure  1-18.  The  estimated  noise  terms  Yt  =  Xt  —  mt  are 
shown  in  Figure  1-19.  As  expected,  they  show  no  apparent  trend.  To  apply  this  filter 
using  ITSM,  open  the  project  STRIKES. TSM,  select  Smooth>Moving  Average, 
specify  2  for  the  filter  order,  and  enter  the  weights  1,1,1  for  Theta(O),  Theta(l), 
and  Theta(2)  (these  are  automatically  normalized  so  that  the  sum  of  the  weights  is 
one).  Then  click  OK. 

□ 

It  is  useful  to  think  of  {mt}  in  (1.5.5)  as  a  process  obtained  from  {XJ  by  application 
of  a  linear  operator  or  linear  filter  mt  =  J2jl-oo  ajXt-j  with  weights  cij  =  (2 q  + 

l)-1 ,  —  q  <  j  <  q.  This  particular  filter  is  a  low-pass  filter  in  the  sense  that  it  takes  the 
data  {Xr}  and  removes  from  it  the  rapidly  fluctuating  (or  high  frequency)  component 

A 

{Fr}  to  leave  the  slowly  varying  estimated  trend  term  {mt}  (see  Figure  1-20). 

The  particular  filter  (1.5.5)  is  only  one  of  many  that  could  be  used  for  smoothing. 
For  large  q ,  provided  that  (2 q  +  1)_1  Ylj=-q  ? t-j  ~  0,  it  not  only  will  attenuate  noise 
but  at  the  same  time  will  allow  linear  trend  functions  mt  —  Co  +  c\t  to  pass  without 
distortion  (see  Problem  1.11).  However,  we  must  beware  of  choosing  q  to  be  too  large, 
since  if  mt  is  not  linear,  the  filtered  process,  although  smooth,  will  not  be  a  good 
estimate  of  mt.  By  clever  choice  of  the  weights  {aj}  it  is  possible  (see  Problems  1.12— 
1.14  and  Section  4.3)  to  design  a  filter  that  will  not  only  be  effective  in  attenuating 
noise  in  the  data,  but  that  will  also  allow  a  larger  class  of  trend  functions  (for  example 
all  polynomials  of  degree  less  than  or  equal  to  3)  to  pass  through  without  distortion. 
The  Spencer  15-point  moving  average  is  a  filter  that  passes  polynomials  of  degree  3 
without  distortion.  Its  weights  are 


di  =  0,  \j\  >  7, 
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Figure  1-19 

/V 

Residuals  Yj  =  Xt  —  rht  after 
subtracting  the 
5-term  moving  average  from 
the  strike  data 


Figure  1-20 

Smoothing  with  a  low-pass 

linear  filter 


with 


ai  ~  a-j’ 


j I  <  7, 


and 

. «]  =  ^P4.67,46,21,3,-5.-6.-3]. 

Applied  to  the  process  (1.5.2)  with  mt  =  cq  +  c\t  +  C2t 2  +  C3/3,  it  gives 


7  7  7  7 

J2  a>Xt~>  =  J2  aJmt-j  +  I]  ^  I]  =  m>- 

i=-i  y=-7  i=—i  j=- 7 


(1.5.6) 


where  the  last  step  depends  on  the  assumed  form  of  (Problem  1.12).  Further  details 
regarding  this  and  other  smoothing  filters  can  be  found  in  Kendall  and  Stuart  (1976, 
Chapter  46). 

(b)  Exponential  smoothing.  For  any  fixed  a  e  [0,1],  the  one-sided  moving 
averages  mt,  t  =  1, ...  ,n,  defined  by  the  recursions 


mt  —  aXt  +  (1  —  t  =  2,  . . . ,  n, 


(1.5.7) 


and 


mi  =Xi  (1.5.8) 

can  be  computed  using  ITSM  by  selecting  Smooth>Exponential  and  specifying 
the  value  of  a.  Application  of  (1.5.7)  and  (1.5.8)  is  often  referred  to  as  exponential 
smoothing,  since  the  recursions  imply  that  for  t  >  2,  mt  =  J]7=o^(l  —  &yxt-j  + 

(1  —  aY~lX i,  a  weighted  moving  average  of  Xt,  Xt-i, . . . ,  with  weights  decreasing 
exponentially  (except  for  the  last  one). 

(c)  Smoothing  by  elimination  of  high-frequency  components.  The  option 
Smooth>FFT  in  the  program  ITSM  allows  us  to  smooth  an  arbitrary  series 
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Figure  1-21 

Exponentially  smoothed 
strike  data  with  a  =  0.4 


Figure  1-22 

Strike  data  smoothed 
by  elimination  of  high 
frequencies  with  f  =  0.4 


Example  1.5.2 


1950  1960  1970  1980 


by  elimination  of  the  high-frequency  components  of  its  Fourier  series  expansion 
(see  Section  4.2).  This  option  was  used  in  Example  1.1.4,  where  we  chose  to  retain 
the  fraction/  =  0.035  of  the  frequency  components  of  the  series  in  order  to  estimate 
the  underlying  signal.  (The  choice/  =  1  would  have  left  the  series  unchanged.) 

In  Figures  1-21  and  1-22  we  show  the  results  of  smoothing  the  strike  data  by  ex¬ 
ponential  smoothing  with  parameter  a  =  0.4  [see  (1.5.7)]  and  by  high-frequency 
elimination  with /  =  0.4,  i.e.,  by  eliminating  a  fraction  0.6  of  the  Fourier  components 
at  the  top  of  the  frequency  range.  These  should  be  compared  with  the  simple  5-term 
moving  average  smoothing  shown  in  Figure  1-18.  Experimentation  with  different 
smoothing  parameters  can  easily  be  carried  out  using  the  program  ITSM.  The  expo¬ 
nentially  smoothed  value  of  the  last  observation  is  frequently  used  to  forecast  the  next 
data  value.  The  program  automatically  selects  an  optimal  value  of  a  for  this  purpose 
if  a  is  specified  as  —  1  in  the  exponential  smoothing  dialog  box. 

□ 
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(d)  Polynomial  fitting.  In  Section  1.3.2  we  showed  how  a  trend  of  the  form 
mt  =  ao  +  a\t  +  ayt2  can  be  fitted  to  the  data  {xi,  . . . ,  xn)  by  choosing  the  parameters 
ao,  a\,  and  a 2  to  minimize  the  sum  of  squares,  YTt=\ (xt  ~  mt )2  (see  Example  1.3.4). 
The  method  of  least  squares  estimation  can  also  be  used  to  estimate  higher-order 
polynomial  trends  in  the  same  way.  The  Regression  option  of  ITSM  allows  least 
squares  fitting  of  polynomial  trends  of  order  up  to  10  (together  with  up  to  four 
harmonic  terms;  see  Example  1.3.6).  It  also  allows  generalized  least  squares  estimation 
(see  Section  6.6),  in  which  correlation  between  the  residuals  is  taken  into  account. 

1 .5.1 .1  Method  2:  Trend  Elimination  by  Differencing 

Instead  of  attempting  to  remove  the  noise  by  smoothing  as  in  Method  1,  we  now 
attempt  to  eliminate  the  trend  term  by  differencing.  We  define  the  lag-1  difference 
operator  V  by 

VX,  =  Xt-  Xt-i  =  (1  -  B)XU  (1.5.9) 

where  B  is  the  backward  shift  operator, 

BXt  =  Xt_  1.  (1.5.10) 

Powers  of  the  operators  B  and  V  are  defined  in  the  obvious  way,  i.e.,  Bj(Xt)  =  Xt_j 
and  Xj(Xt)  =  V(VJ_1(Xr)),  j  >  1,  with  V°(2Q  =  Xt.  Polynomials  in  B  and  V  are 
manipulated  in  precisely  the  same  way  as  polynomial  functions  of  real  variables.  For 
example, 

V2Xf  =  V(V(Xt))  =  (1  -  B)(  1  -  B)Xt  =  (1  -  2 B  +  B2)X, 

=  X,-2X,_l  +Xt_2. 

If  the  operator  V  is  applied  to  a  linear  trend  function  mt  =  Co  +  c\t,  then  we  obtain  the 
constant  function  Vm,  =  m,  —  mt_\  =  c0  +  c\t  —  (co  +  c\(t  —  1))  =  c\.  In  the  same 
way  any  polynomial  trend  of  degree  k  can  be  reduced  to  a  constant  by  application  of 
the  operator  Xk  (Problem  1.10).  For  example,  if  Xt  —  mt  +  Yt ,  where  mt  —  Ylj= 0  cf'} 

and  Yt  is  stationary  with  mean  zero,  application  of  Xk  gives 
VkX,  =  k\ck  +  VkYt, 

a  stationary  process  with  mean  D.c^.  These  considerations  suggest  the  possibility, 
given  any  sequence  {xr}  of  data,  of  applying  the  operator  V  repeatedly  until  we  find 
a  sequence  {V*jq}  that  can  plausibly  be  modeled  as  a  realization  of  a  stationary 
process.  It  is  often  found  in  practice  that  the  order  k  of  differencing  required  is  quite 
small,  frequently  one  or  two.  (This  relies  on  the  fact  that  many  functions  can  be 
well  approximated,  on  an  interval  of  finite  length,  by  a  polynomial  of  reasonably  low 
degree.) 

Example  1 .5.3  Applying  the  operator  V  to  the  population  values  { xt ,  t  =  1, . . . ,  20}  of  Figure  1-5,  we 

find  that  two  differencing  operations  are  sufficient  to  produce  a  series  with  no  apparent 
trend.  (To  do  the  differencing  in  ITSM,  select  Transf  orm>Dif  f  erence, enter  the 
value  1  for  the  differencing  lag,  and  click  OK.)  This  replaces  the  original  series  {xr} 
by  the  once-differenced  series  {xt  —  xt-{\.  Repetition  of  these  steps  gives  the  twice- 
differenced  series  X2xt  —  xt  —  2xr_i  +  xr_2,  plotted  in  Figure  1-23.  Notice  that  the 
magnitude  of  the  fluctuations  in  X2xt  increases  with  the  value  of  xt.  This  effect  can  be 
suppressed  by  first  taking  natural  logarithms,  yt  —  lnxr,  and  then  applying  the  operator 
V2  to  the  series  {yj.  (See  also  Figures  1-1  and  1-17.) 


□ 
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Figure  1-23 

The  twice-differenced  series 
derived  from  the  population 
data  of  Figure  1  -5 


1.5.2  Estimation  and  Elimination  of  Both  Trend  and  Seasonality 

The  methods  described  for  the  estimation  and  elimination  of  trend  can  be  adapted  in 
a  natural  way  to  eliminate  both  trend  and  seasonality  in  the  general  model,  specified 
as  follows. 


Classical  Decomposition  Model 

X,  =  m,  +  s,  +  Yt, 

(1.5.11) 

where  EY,  =  0,  st+d  =  s„  and  Y?j=\  sj  = 

We  shall  illustrate  these  methods  with  reference  to  the  accidental  deaths  data  of 
Example  1.1.3,  for  which  the  period  d  of  the  seasonal  component  is  clearly  12. 

1.5.2. 7  Method  57:  Estimation  of  Trend  and  Seasonal  Components 

The  method  we  are  about  to  describe  is  used  in  the  Transf  orm>Classical  option 

ofITSM. 

Suppose  we  have  observations  {xi,  . . . ,  xn).  The  trend  is  first  estimated  by  app¬ 
lying  a  moving  average  filter  specially  chosen  to  eliminate  the  seasonal  component 
and  to  dampen  the  noise.  If  the  period  d  is  even,  say  d  —  2 q,  then  we  use 

mt  =  (0.5xt_q  +  xt-q+i  H - h%ri  +  0.5 xt+q)/d,  q  <  t  <  n  -  q.  (1.5.12) 

If  the  period  is  odd,  say  d  =  2q  +  1,  then  we  use  the  simple  moving  average  (1.5.5). 

The  second  step  is  to  estimate  the  seasonal  component.  For  each  k  =  l,  ...  ,d, 
we  compute  the  average  Wk  of  the  deviations  {(xk+jd  —  fhk+jd),  q<k+jd<n  —  q}. 
Since  these  average  deviations  do  not  necessarily  sum  to  zero,  we  estimate  the  seasonal 
component  Sk  as 

d 

sk  =  Wk-d~lJ2^  k=l,...,d,  (1.5.13) 

i=\ 


and  sk  =  sk-d,  k  >  d. 
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Figure  1-24 

The  deseasonalized 
accidental  deaths 
data  from  ITSM 


Example  1.5.4 


The  deseasonalized  data  is  then  defined  to  be  the  original  series  with  the  estimated 
seasonal  component  removed,  i.e., 

dt  =  xt  —  st,  t  =  1, . . . ,  n.  (1.5.14) 

Finally,  we  reestimate  the  trend  from  the  deseasonalized  data  {dt}  using  one  of 
the  methods  already  described.  The  program  ITSM  allows  you  to  fit  a  least  squares 
polynomial  trend  m  to  the  deseasonalized  series.  In  terms  of  this  reestimated  trend  and 
the  estimated  seasonal  component,  the  estimated  noise  series  is  then  given  by 

Yt  =  xt  —  mt  —  st,  t=l,...,n. 

The  reestimation  of  the  trend  is  done  in  order  to  have  a  parametric  form  for  the  trend 
that  can  be  extrapolated  for  the  purposes  of  prediction  and  simulation. 

Figure  1-24  shows  the  deseasonalized  accidental  deaths  data  obtained  from  ITSM 
by  reading  in  the  series  DEATHS. TSM,  selecting  Trans  form>Cl  as  si  cal,  check¬ 
ing  only  the  box  marked  Seasonal  Fit,  entering  12  for  the  period,  and  clicking  OK. 
The  estimated  seasonal  component  st,  shown  in  Figure  1-25,  is  obtained  by  selecting 
Transf orm>Show  Classical  Fit.  (Except  for  having  a  mean  of  zero,  this 
estimate  is  very  similar  to  the  harmonic  regression  function  with  frequencies  27T/12 
and  2tt/6  displayed  in  Figure  1-11.)  The  graph  of  the  deseasonalized  data  suggests 
the  presence  of  an  additional  quadratic  trend  function.  In  order  to  fit  such  a  trend 
to  the  deseasonalized  data,  select  Transf  orm>Undo  Classical  to  retrieve  the 
original  data  and  then  select  Transf  orm>Classical  and  check  the  boxes  marked 
Seasonal  Fit  and  Polynomial  Trend,  entering  12  for  the  period  and  select¬ 
ing  Quadratic  for  the  trend.  Then  click  OK  and  you  will  obtain  the  trend  function 

mt  =  9952  -  71.82?  +  0.8260?2,  1  <  t  <  72. 

At  this  point  the  data  stored  in  ITSM  consists  of  the  estimated  noise 

Yt  =  xt  -  mt  -  su  t  =  1,  . . . ,  72, 

obtained  by  subtracting  the  estimated  seasonal  and  trend  components  from  the  original 
data. 

□ 
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Figure  1-25 

The  estimated  seasonal 
component  of  the 
accidental  deaths  data  from 

ITSM 


Example  1.5.5 


/ .5.2.2  Method  S2:  Elimination  of  Trend  and  Seasonal  Components 
by  Differencing 

The  technique  of  differencing  that  we  applied  earlier  to  nonseasonal  data  can  be 
adapted  to  deal  with  seasonality  of  period  d  by  introducing  the  lag -d  differencing 
operator  Vj  defined  by 

WdXt  =  Xt-  Xt_d  =  (1  -  Bd)Xt.  (1.5.15) 

(This  operator  should  not  be  confused  with  the  operator  Vd  =  (1  —B)d  defined  earlier.) 
Applying  the  operator  Xd  to  the  model 

Xt  =  mt  +  st  +  Yt, 

where  {57}  has  period  d ,  we  obtain 

WdXt  =  mt-  mt_d  +  Yt-  Yt_d, 

which  gives  a  decomposition  of  the  difference  XdXt  into  a  trend  component  (mt—mt_d) 
and  a  noise  term  (Y,  —  Yt_d).  The  trend,  mt  —  mt-d ,  can  then  be  eliminated  using  the 
methods  already  described,  in  particular  by  applying  a  power  of  the  operator  V. 

Figure  1-26  shows  the  result  of  applying  the  operator  V12  to  the  accidental  deaths 
data.  The  graph  is  obtained  from  ITSM  by  opening  DEATHS. TSM,  selecting  Trans  - 
form>Difference,  entering  lag  12,  and  clicking  OK.  The  seasonal  component 
evident  in  Figure  1-3  is  absent  from  the  graph  of  13  <  t  <  72.  However, 

there  still  appears  to  be  a  nondecreasing  trend.  If  we  now  apply  the  operator  V  to 
{Vi2Vr}  by  again  selecting  Transf  orm>Dif  f  erence,  this  time  with  lag  one,  we 
obtain  the  graph  of  V  V^,  14  <  t  <  72,  shown  in  Figure  1-27,  which  has  no  apparent 
trend  or  seasonal  component.  In  Chapter  5  we  show  that  this  doubly  differenced  series 
can  in  fact  be  well  represented  by  a  stationary  time  series  model. 

□ 

In  this  section  we  have  discussed  a  variety  of  methods  for  estimating  and/or 
removing  trend  and  seasonality.  The  particular  method  chosen  for  any  given  data 
set  will  depend  on  a  number  of  factors  including  whether  or  not  estimates  of  the 
components  of  the  series  are  required  and  whether  or  not  it  appears  that  the  data  contain 
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Figure  1-26 

The  differenced  series 
{Vi2xo  f  =  13,...,  72} 
derived  from  the  monthly 
accidental  deaths 
{xt  ,t=  1 ,  —  ,  72} 


LO 

o 


1974  1975  1976  1977  1978  1979 


Figure  1-27 

The  differenced  series 
{VV-|2*t,  t=  14,...,  72} 
derived  from  the  monthly 
accidental  deaths 
{xt,  t  =  1 , . . . ,  72} 
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a  seasonal  component  that  does  not  vary  with  time.  The  program  ITSM  allows  two 
options  under  the  Transform  menu: 

1.  “classical  decomposition,”  in  which  trend  and/or  seasonal  components  are  esti¬ 
mated  and  subtracted  from  the  data  to  generate  a  noise  sequence,  and 

2.  “differencing,”  in  which  trend  and/or  seasonal  components  are  removed  from  the 
data  by  repeated  differencing  at  one  or  more  lags  in  order  to  generate  a  noise 
sequence. 

A  third  option  is  to  use  the  Regression  menu,  possibly  after  applying  a  Box-Cox 
transformation.  Using  this  option  we  can  (see  Example  1.3.6) 

3.  fit  a  sum  of  harmonics  and  a  polynomial  trend  to  generate  a  noise  sequence  that 
consists  of  the  residuals  from  the  regression. 
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In  the  next  section  we  shall  examine  some  techniques  for  deciding  whether  or  not  the 
noise  sequence  so  generated  differs  significantly  from  iid  noise.  If  the  noise  sequence 
does  have  sample  autocorrelations  significantly  different  from  zero,  then  we  can  take 
advantage  of  this  serial  dependence  to  forecast  future  noise  values  in  terms  of  past 
values  by  modeling  the  noise  as  a  stationary  time  series. 


1 .6  Testing  the  Estimated  Noise  Sequence 

The  objective  of  the  data  transformations  described  in  Section  1.5  is  to  produce  a 
series  with  no  apparent  deviations  from  stationarity,  and  in  particular  with  no  apparent 
trend  or  seasonality.  Assuming  that  this  has  been  done,  the  next  step  is  to  model  the 
estimated  noise  sequence  (i.e.,  the  residuals  obtained  either  by  differencing  the  data 
or  by  estimating  and  subtracting  the  trend  and  seasonal  components).  If  there  is  no 
dependence  among  between  these  residuals,  then  we  can  regard  them  as  observations 
of  independent  random  variables,  and  there  is  no  further  modeling  to  be  done  except  to 
estimate  their  mean  and  variance.  However,  if  there  is  significant  dependence  among 
the  residuals,  then  we  need  to  look  for  a  more  complex  stationary  time  series  model 
for  the  noise  that  accounts  for  the  dependence.  This  will  be  to  our  advantage,  since 
dependence  means  in  particular  that  past  observations  of  the  noise  sequence  can  assist 
in  predicting  future  values. 

In  this  section  we  examine  some  simple  tests  for  checking  the  hypothesis  that 
the  residuals  from  Section  1.5  are  observed  values  of  independent  and  identically 
distributed  random  variables.  If  they  are,  then  our  work  is  done.  If  not,  then  we  must 
use  the  theory  of  stationary  processes  to  be  developed  in  later  chapters  to  find  a  more 
appropriate  model. 

(a)  The  sample  autocorrelation  function.  For  large  n ,  the  sample  autocorrela¬ 
tions  of  an  iid  sequence  Y\,  ...  ,Yn  with  finite  variance  are  approximately  iid  with 
distribution  N(0,  l/n)  (see  Brockwell  and  Davis  (1991)  p.  222).  Hence,  ifyi,...,yn 
is  a  realization  of  such  an  iid  sequence,  about  95  %  of  the  sample  autocorrelations 
should  fall  between  the  bounds  ±1.96 /^fn.  If  we  compute  the  sample  autocorrelations 
up  to  lag  40  and  find  that  more  than  two  or  three  values  fall  outside  the  bounds,  or 
that  one  value  falls  far  outside  the  bounds,  we  therefore  reject  the  iid  hypothesis.  The 
bounds  ±1.96 /^/n  are  automatically  plotted  when  the  sample  autocorrelation  function 
is  computed  by  the  program  ITSM. 

(b)  The  portmanteau  test.  Instead  of  checking  to  see  whether  each  sample 
autocorrelation  p{j)  falls  inside  the  bounds  defined  in  (a)  above,  it  is  also  possible 
to  consider  the  single  statistic 

h 

Q  =  n^2p2(j). 

j= i 

If  Y\,  . . . ,  Yn  is  a  finite-variance  iid  sequence,  then  by  the  same  result  used  in  (a),  Q 
is  approximately  distributed  as  the  sum  of  squares  of  the  independent  N(0,  1)  random 
variables,  +Jnp(j),  j  =  1,  ...  ,h,  i.e.,  as  chi-squared  with  h  degrees  of  freedom.  A 
large  value  of  Q  suggests  that  the  sample  autocorrelations  of  the  data  are  too  large  for 
the  data  to  be  a  sample  from  an  iid  sequence.  We  therefore  reject  the  iid  hypothesis 
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at  level  a  if  Q  >  xi-a(h),  where  X\-a(h)  is  the  1  —  a  quantile  of  the  chi-squared 
distribution  with  h  degrees  of  freedom.  The  program  ITSM  conducts  a  refinement  of 
this  test,  formulated  by  Ljung  and  Box  (1978),  in  which  Q  is  replaced  by 

h 

2lb  =  n(n  +  2)  ^  p2{j  )/(«  -j), 

,/=i 

whose  distribution  is  better  approximated  by  the  chi-squared  distribution  with  h 
degrees  of  freedom. 

Another  portmanteau  test,  formulated  by  McLeod  and  Li  (1983),  can  be  used  as 
a  further  test  for  the  iid  hypothesis,  since  if  the  data  are  iid,  then  the  squared  data  are 
also  iid.  It  is  based  on  the  same  statistic  used  for  the  Ljung-Box  test,  except  that  the 
sample  autocorrelations  of  the  data  are  replaced  by  the  sample  autocorrelations  of  the 
squared  data,  /5ww(/z),  giving 

h 

2ml  =  n(n  +  2)  pl^(k)/(n  -  k). 

k=  1 

The  hypothesis  of  iid  data  is  then  rejected  at  level  a  if  the  observed  value  of  2ml  is 
larger  than  the  1  —  a  quantile  of  the  x2(/z)  distribution. 

(c)  The  turning  point  test.  If  y\,  . . . ,  yn  is  a  sequence  of  observations,  we  say 
that  there  is  a  turning  point  at  time  /,  I  <  i  <  n,  if  V/—  i  <  yt  and  yt  >  yi+\  or  if 
>’/_  i  >  yL  and  yt  <  yi+\.  If  T  is  the  number  of  turning  points  of  an  iid  sequence  of 
length  n ,  then,  since  the  probability  of  a  turning  point  at  time  i  is  |,  the  expected  value 
of  T  is 


\iT  =  E(T)  =  2(n-2)/3. 


It  can  also  be  shown  for  an  iid  sequence  that  the  variance  of  T  is 
a2  =  Var (T)  =  (16 n  -  29)/90. 


A  large  value  of  T  —  fiT  indicates  that  the  series  is  fluctuating  more  rapidly  than 
expected  for  an  iid  sequence.  On  the  other  hand,  a  value  of  T  —  pT  much  smaller 
than  zero  indicates  a  positive  correlation  between  neighboring  observations.  For  an  iid 
sequence  with  n  large,  it  can  be  shown  that 

T  is  approximately  N (/xr,  o^). 


This  means  we  can  carry  out  a  test  of  the  iid  hypothesis,  rejecting  it  at  level  a  if 
| T  —  Pt\/&t  >  d>i_a/2,  where  Q\-a/2  is  the  1  —  a/2  quantile  of  the  standard  normal 
distribution.  (A  commonly  used  value  of  a  is  0.05,  for  which  the  corresponding  value 
of  d>i_a/2  is  1.96.) 

(d)  The  difference-sign  test.  For  this  test  we  count  the  number  S  of  values  of  i 
such  that  yi  >  y;_i,  /  =  2, . . . ,  n,  or  equivalently  the  number  of  times  the  differenced 
series  V/  —  is  positive.  For  an  iid  sequence  it  is  clear  that 


ps  =  ES  =  n  -  1). 


It  can  also  be  shown,  under  the  same  assumption,  that 
a2  =  Var(5)  =  (n  +  1)/12, 


and  that  for  large  n , 

S  is  approximately  N  o^). 
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A  large  positive  (or  negative)  value  of  S  —  /is  indicates  the  presence  of  an  increasing 
(or  decreasing)  trend  in  the  data.  We  therefore  reject  the  assumption  of  no  trend  in  the 
data  if  |S  -  ns\ M  >  ^i-a/2- 

The  difference- sign  test  must  be  used  with  caution.  A  set  of  observations  exhibit¬ 
ing  a  strong  cyclic  component  will  pass  the  difference-sign  test  for  randomness,  since 
roughly  half  of  the  observations  will  be  points  of  increase. 

(e)  The  rank  test.  The  rank  test  is  particularly  useful  for  detecting  a  linear  trend 
in  the  data.  Define  P  to  be  the  number  of  pairs  (ij)  such  that  yj  >  yt  and  j  >  i, 
i  =  1,  . . . ,  n  —  1.  There  is  a  total  of  (")  =  \n(n  —  1)  pairs  (ij)  such  that  j  >  i.  For 

an  iid  sequence  {Y\,  . . . ,  Yn),  each  event  {Yj  >  Y{\  has  probability  and  the  mean 
of  P  is  therefore 


1 

jiP  —  -n(n  —  1). 

It  can  also  be  shown  for  an  iid  sequence  that  the  variance  of  P  is 
Op  —  n(n  —  1)(2  n  +  5)/72 


and  that  for  large  n , 

P  is  approximately  N (/xp,  exp) 

(see  Kendall  and  Stuart  1976).  A  large  positive  (negative)  value  of  P  —  [ip  indicates  the 
presence  of  an  increasing  (decreasing)  trend  in  the  data.  The  assumption  that  {yj}  is  a 
sample  from  an  iid  sequence  is  therefore  rejected  at  level  a  =  0.05  if  \P  —  tip\/<Jp  > 
^i-a/2  =  1  -96. 

( f )  Fitting  an  autoregressive  model.  A  further  test  that  can  be  carried  out  using 
the  program  ITSM  is  to  fit  an  autoregressive  model  to  the  data  using  the  Yule-Walker 
algorithm  (discussed  in  Section  5.1.1)  and  choosing  the  order  which  minimizes  the 
AICC  statistic  (see  Section  5.5).  A  selected  order  equal  to  zero  suggests  that  the  data 
is  white  noise. 

(g)  Checking  for  normality.  If  the  noise  process  is  Gaussian,  i.e.,  if  all  of  its 
joint  distributions  are  normal,  then  stronger  conclusions  can  be  drawn  when  a  model 
is  fitted  to  the  data.  The  following  test  enables  us  to  check  whether  it  is  reasonable 
to  assume  that  observations  from  an  iid  sequence  are  also  Gaussian. 

Let  F( i)  <  F( 2)  <  •  •  •  <  F(n)  be  the  order  statistics  of  a  random  sample  Y\,  . . . ,  Yn 
from  the  distribution  N(/z,  a2).  If  X(i)  <  X ^  <  •  •  •  <  X ^  are  the  order  statistics 
from  a  N(0,  1)  sample  of  size  n ,  then 


EY(j)  =  fi  +  crmj , 

where  mj  =  EX^j^,  j  =  1, . . . ,  n.  The  graph  of  the  points  (mi,  F(i)),  . . . ,  ( mn ,  Y^ 
is  called  a  Gaussian  qq  plot)  and  can  be  displayed  in  ITSM  by  clicking  on  the  yellow 
button  labeled  QQ.  If  the  normal  assumption  is  correct,  the  Gaussian  qq  plot  should 
be  approximately  linear.  Consequently,  the  squared  correlation  of  the  points  (m/?  Y{l)), 
i  =  1,  . . . ,  n,  should  be  near  1.  The  assumption  of  normality  is  therefore  rejected  if  the 
squared  correlation  R2  is  sufficiently  small.  If  we  approximate  mL  by  O-1  ((/  —  0.5) /n) 
(see  Mage  1982  for  some  alternative  approximations),  then  R 2  reduces  to 

R2  (LUru,-m-'(^))! 

EL,  Wi,  -  7>2  (■K-1  (^P))2 

where  F  =  n~l(Y\  H - b  Yn ).  Percentage  points  for  the  distribution  of  R 2,  assuming 

normality  of  the  sample  values,  are  given  by  Shapiro  and  Francia  (1972)  for  sample 
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Figure  1-28 

The  sample  autocorrelation 
function  for  the  data  of 
Example  1 .1 .4  showing 
the  bounds  ±1 .96 /yfn 


Example  1.6.1 
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sizes  n  <  100.  For  n  —  200,  P{R 2  <  0.987)  =  0.05  and  P(R 2  <  0.989)  =  0.10.  For 
larger  values  of  n  the  Jarque-Bera  test  (Jarque  and  Bera,  1980)  for  normality  can  be 
used  (see  Section  5.3.3). 

If  we  did  not  know  in  advance  how  the  signal  plus  noise  data  of  Example  1.1.4  were 
generated,  we  might  suspect  that  they  came  from  an  iid  sequence.  We  can  check  this 
hypothesis  with  the  aid  of  the  tests  (a)-(f)  introduced  above. 

(a)  The  sample  autocorrelation  function  (Figure  1-28)  is  obtained  from  ITSM  by 
opening  the  project  SIGNAL.TSM  and  clicking  on  the  second  yellow  button  at  the 
top  of  the  ITSM  window.  Observing  that  25  %  of  the  autocorrelations  are 
outside  the  bounds  ±1.96/V200,  we  reject  the  hypothesis  that  the  series  is  iid. 

The  remaining  tests  (b),  (c),  (d),  (e),  and  (f)  are  performed  by  choosing  the 
option  Statistics>Residual  Analysis>Tests  of  Randomness. 
(Since  no  model  has  been  fitted  to  the  data,  the  residuals  are  the  same  as  the  data 
themselves.) 

(b)  The  sample  value  of  the  Ljung-Box  statistic  <2lb  with  h  =  20  is  51.84.  Since  the 
corresponding  p-value  (displayed  by  ITSM)  is  0.00012  <  0.05,  we  reject  the  iid 
hypothesis  at  level  0.05.  The p-v alue  for  the  McLeod-Li  statistic  2ml  is  0.7 17.  The 
McLeod-Li  statistic  does  therefore  not  provide  sufficient  evidence  to  reject  the  iid 
hypothesis  at  level  0.05. 

(c)  The  sample  value  of  the  turning-point  statistic  T  is  138,  and  the  asymptotic  distri¬ 
bution  under  the  iid  hypothesis  (with  sample  size  n  —  200)  is  N(  132,  35.3).  Thus 
\T—pT\/aT  —  1.01,  corresponding  to  a  computed p-value  of  0.312.  On  the  basis  of 
the  value  of  T  there  is  therefore  not  sufficient  evidence  to  reject  the  iid  hypothesis 
at  level  0.05. 

(d)  The  sample  value  of  the  difference- sign  statistic  S  is  101,  and  the  asymptotic 
distribution  under  the  iid  hypothesis  (with  sample  size  n  —  200)  is  N(99.5,  16.7). 
Thus  |S— /xs|/os  =  0.38,  corresponding  to  a  computed  p-\ alue  of  0.714.  On  the 
basis  of  the  value  of  S  there  is  therefore  not  sufficient  evidence  to  reject  the  iid 
hypothesis  at  level  0.05. 
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(e)  The  sample  value  of  the  rank  statistic  P  is  10,310,  and  the  asymptotic  distribu¬ 
tion  under  the  iid  hypothesis  (with  n= 200)  is  N(9950,  2.239 x  105).  The  statistic 
| P  —  iip\/<Jp ,  is  therefore  equal  to  0.76,  corresponding  to  a p-v alue  of  0.447.  On 
the  basis  of  the  value  of  P  there  is  therefore  not  sufficient  evidence  to  reject  the 
iid  hypothesis  at  level  0.05. 

(f)  The  minimum-AICC  Yule-Walker  autoregressive  model  for  the  data  is  of 
order  seven,  supporting  the  evidence  provided  by  the  sample  ACF  and  Ljung-Box 
tests  against  the  iid  hypothesis. 

Thus,  although  not  all  of  the  tests  detect  significant  deviation  from  iid  behavior,  the 
sample  autocorrelation,  the  Ljung-Box  statistic,  and  the  fitted  autoregression  provide 
strong  evidence  against  it,  causing  us  to  reject  it  (correctly)  in  this  example. 

□ 

The  general  strategy  in  applying  the  tests  described  in  this  section  is  to  check  them 
all  and  to  proceed  with  caution  if  any  of  them  suggests  a  serious  deviation  from  the  iid 
hypothesis.  (Remember  that  as  you  increase  the  number  of  tests,  the  probability  that 
at  least  one  rejects  the  null  hypothesis  when  it  is  true  increases.  You  should  therefore 
not  necessarily  reject  the  null  hypothesis  on  the  basis  of  one  test  result  only.) 


Problems 


1.1  Let  X  and  Y  be  two  random  variables  with  E(Y)  =  /z  and  EY 2  <  oc. 

a.  Show  that  the  constant  c  that  minimizes  E(Y  —  c)2  is  c  =  /z. 

b.  Deduce  that  the  random  variable/(Y)  that  minimizes  E[(Y  -f(X))2\X]  is 

f(X)=E[Y\X\. 

c.  Deduce  that  the  random  variable/(Y)  that  minimizes  E(Y  —f(X))2  is  also 

f(X)=E[Y\X]. 

1.2  (Generalization  of  Problem  1.1.)  Suppose  that  X\,  X2,  . . .  is  a  sequence  of  ran¬ 
dom  variables  with  E{X2)  <  oc  and  E(Xt)  —  /z. 

a.  Show  that  the  random  variable /(Yi,  . . . ,  Xn)  that  minimizes  the  conditional 
mean  squared  error,  E\[Xn+\  —f(X  1,  . . . ,  Xn))2\X\,  . . . ,  Y/?],  is 

f(Xi,  . . .  ,Xn)  =  E[Xn+i\X\,  . . .  ,Xn\. 

b.  Deduce  that  the  random  variable  f(X\ ,  . . . ,  Xn)  that  minimizes  the  uncondi¬ 
tional  mean  squared  error,  E[(Xn+ \  — /(Y  1,  . . . ,  Y/?))2],  is  also 

f(X\,  . . .  ,Xn)  =  E[Xn+\\X\,  . . .  ,Xn\. 

c.  If  Xi,X2,  ...  is  iid  with  E(X2)  <  oc  and  EXt  =  /z,  where  /z  is  known,  what 
is  the  minimum  mean  squared  error  predictor  of  Xn+\  in  terms  of  X\ , . . . ,  Xnl 

d.  Under  the  conditions  of  part  (c)  show  that  the  best  linear  unbiased  estimator 

of  fi  in  terms  of  X\ ,  . . . ,  Xn  is  X  =  \}(X\  4 - \-Xn).  (/z  said  to  be  an  unbiased 

estimator  of  /z  if  E /z  =  /z  for  all  /z.) 

e.  Under  the  conditions  of  part  (c)  show  that  X  is  the  best  linear  predictor  of 
Xn+ 1  that  is  unbiased  for  /z. 

f.  If  X\ ,  X2,  ...  is  iid  with  E(X2)  <  oc  and  EXt  =  /z,  and  if  So  =  0,  Sn  = 

X\  4 - 1- Xn,  n—  1,2,...,  what  is  the  minimum  mean  squared  error  predictor 

of  Sn+ 1  in  terms  of  Si , ... ,  S„? 
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1.3  Show  that  a  strictly  stationary  process  with  E(Xf)  <  oo  is  weakly  stationary. 

1.4  Let  {Zt}  be  a  sequence  of  independent  normal  random  variables,  each  with 
mean  0  and  variance  a2,  and  let  a ,  b ,  and  c  be  constants.  Which,  if  any,  of  the 
following  processes  are  stationary?  For  each  stationary  process  specify  the  mean 
and  autocovariance  function. 

a.  Xf  —  a  H-  bZf  H-  cZ^_ 2 

b.  Xt  =  Z\  cos(ct)  +  Z2  sin(cf) 

c.  Xt  —  Zt  cos(cf)  +  Zt_  1  sin(cf) 

d.  Xf  —  a  H-  bZ{) 

e.  =  ZqCOs(c0 

f.  X,  =  ZfZ?_! 


1.5  Let  {Xt}  be  the  moving-average  process  of  order  2  given  by 

xt  =  Zt  +  0Zt- 2, 
where  {Ztj  is  WN(0,  1). 

a.  Find  the  autocovariance  and  autocorrelation  functions  for  this  process  when 

e  =  0.8. 

b.  Compute  the  variance  of  the  sample  mean  (X 1 +X2  +X3  +X4 )  /  4  when  0=0.8. 

c.  Repeat  (b)  when  6  =  —0.8  and  compare  your  answer  with  the  result  obtained 
in  (b). 


1.6  Let  {Xt}  be  the  AR(1)  process  defined  in  Example  1.4.5. 

a.  Compute  the  variance  of  the  sample  mean  (X\  +X2  +X3  +X4) /4  when  0  =  0.9 
and  a2  =  1. 

b.  Repeat  (a)  when  0  =  —  0.9  and  compare  your  answer  with  the  result  obtained 
in  (a). 


1.7  If  {Xr}  and  {Yt}  are  uncorrelated  stationary  sequences,  i.e.,  if  Xr  and  Ys  are  uncor¬ 
related  for  every  r  and  s,  show  that  [Xt  +  Yt)  is  stationary  with  autocovariance 
function  equal  to  the  sum  of  the  autocovariance  functions  of  {Xr}  and  {Fr}. 


1.8  Let  {Zt}  be  IID  N(0,  1)  noise  and  define 

IZt,  if  t  is  even, 

(Z2_  j  —  1)/V2,  if  t  is  odd. 


a.  Show  that  {Z?}  is  WN(0,  1)  but  not  iid(0,  1)  noise. 

b.  Find  E(Xn+ \\X\,  . . . ,  Xn)  for  n  odd  and  n  even  and  compare  the  results. 

1.9  Let  {xi,  . . . , xn)  be  observed  values  of  a  time  series  at  times  1,  ...  ,n,  and  let 
p(h)  be  the  sample  ACF  at  lag  h  as  in  Definition  1.4.4. 

a.  If  xt  =  a  +  bt ,  where  a  and  b  are  constants  and  i  /  0,  show  that  for  each 
fixed  h  >  1, 

p(h)  — >  I  as  n  — >  00. 

b.  If  xt  =  ccos(a)t ),  where  c  and  co  are  constants  (c  /  0  and  co  e  (—tv,  tv]), 
show  that  for  each  fixed  h , 

/5(/z)  ->  cos(n;/z)  as  n  00. 
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1.10  If  m,  =  Y!l  =0  Cktk ,  t  =  0,  ±  1 , . . . ,  show  that  Vmr  is  a  polynomial  of  degree  p—  1 
in  r  and  hence  that  XpJrlmt  =  0. 

1.11  Consider  the  simple  moving-average  filter  with  weights  ay  =  ( 2q  +  1)  _1>  -<?  < 

j  <  <?• 

a.  If  ////  =  Co  T  cif  show  that  -4  =  /7//' 

b.  If  Zr,  r  =  0,  =bl,  ±2,  . . . ,  are  independent  random  variables  with  mean  0 
and  variance  a2,  show  that  the  moving  average  At  —  Y^lj=-q  aj^t-j  is  “small” 

for  large  q  in  the  sense  that  EAt  =  0  and  Var(Ar)  =  a2 /(2q  +  1). 


1.12  a.  Show  that  a  linear  filter  { aj }  passes  an  arbitrary  polynomial  of  degree  k  without 
distortion,  i.e.,  that 


mt 


=  J2  a'm>- 


j 


for  all  &th-degree  polynomials  mt  =  cq  +  c\t  +  •  •  •  +  c ktk,  if  and  only  if 


E  ai  = 1 

j 


and 


a j  =  0,  for  r  —  1,  ...  ,k. 


b.  Deduce  that  the  Spencer  15-point  moving-average  filter  {cij}  defined  by  (1.5.6) 
passes  arbitrary  third-degree  polynomial  trends  without  distortion. 

1.13  Find  a  filter  of  the  form  1  +  aB  +  fiB 2  +  yB3  (i.e.,  find  a,  yS,  and  y)  that 
passes  linear  trends  without  distortion  and  that  eliminates  arbitrary  seasonal 
components  of  period  2. 

1.14  Show  that  the  filter  with  coefficients  [a_ 2,  a_i,  ao,  a\,  a{\  —  \[—  1,  4,  3,  4,  —1] 
passes  third-degree  polynomials  and  eliminates  seasonal  components  with  pe¬ 
riod  3. 


1.15  Let  {Fr}  be  a  stationary  process  with  mean  zero  and  let  a  and  b  be  constants. 

a.  If  Xt  =  a  +  bt+st-\-Yt ,  where  st  is  a  seasonal  component  with  period  12,  show 
that  VV12 Xt  =  (1  —B)(  1  —  Bn)Xt  is  stationary  and  express  its  autocovariance 
function  in  terms  of  that  of  {Yt}. 

b.  If  Xt  =  (a  +  bt)st  +  Yt ,  where  st  is  a  seasonal  component  with  period  12, 
show  that  V22Xr  =  (1  —  Bl2)2Xt  is  stationary  and  express  its  autocovariance 
function  in  terms  of  that  of  {Yt}. 

1.16  (Using  ITSM  to  smooth  the  strikes  data.)  Double-click  on  the  ITSM  icon, 
select  File>Proj  ect  >Open>Univariate,  click  OK,  and  open  the  file 
STRIKES.  TSM.  The  graph  of  the  data  will  then  appear  on  your  screen.  For 
smoothing  select  either  Smooth>Moving  Ave,Smooth>Exponential,  or 
Smooth>FFT.  Try  using  each  of  these  to  reproduce  the  results  shown  in 
Figures  1-18,  1-21,  and  1-22. 

1.17  (Using  ITSM  to  plot  the  deaths  data.)  In  ITSM  select  File>Pro  j  ect  >Open> 
Univariate,  click  OK,  and  open  the  project  DEATHS. TSM.  The  graph  of 
the  data  will  then  appear  on  your  screen.  To  see  a  histogram  of  the  data,  click 
on  the  sixth  yellow  button  at  the  top  of  the  ITSM  window.  To  see  the  sample 
autocorrelation  function,  click  on  the  second  yellow  button.  The  presence  of  a 
strong  seasonal  component  with  period  12  is  evident  in  the  graph  of  the  data  and 
in  the  sample  autocorrelation  function. 
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1.18  (Using  ITSM  to  analyze  the  deaths  data.)  Open  the  file  DEATHS. TSM,  select 
Trans  f  orm>Cl  as  si  cal,  check  the  box  marked  Seasonal  Fit,  and  enter 
12  for  the  period.  Make  sure  that  the  box  labeled  Polynomial  Fit  is  not 
checked,  and  click,  OK.  You  will  then  see  the  graph  (Figure  1-24)  of  the 
deseasonalized  data.  This  graph  suggests  the  presence  of  an  additional  quadratic 
trend  function.  To  fit  such  a  trend,  select  Transf  orm>Undo  Classical  to 
retrieve  the  original  data.  Then  select  Transf  orm>Classical  and  check  the 
boxes  marked  Seasonal  Fit  and  Polynomial  Trend,  entering  12  for  the 
period  and  Quadratic  for  the  trend.  Click  OK  to  obtain  the  trend  function 

mt  =  9952  -  71.82?  +  0.8260?2,  1  <  t  <  72. 

At  this  point  the  data  stored  in  ITSM  consists  of  the  estimated  noise 

Yt—xt  —  fnt  —  st,  t  =  1, . . . ,  72, 

obtained  by  subtracting  the  estimated  seasonal  and  trend  components 
from  the  original  data.  The  sample  autocorrelation  function  can  be  plotted 
by  clicking  on  the  second  yellow  button  at  the  top  of  the  ITSM  window. 
Further  tests  for  dependence  can  be  carried  out  by  selecting  the  options 
Statistics>Residual  Analysis>Tests  of  Randomness.  These 
show  clearly  the  substantial  dependence  in  the  series  {Fr}. 

To  forecast  the  data  without  allowing  for  this  dependence,  select  the 
option  Forecast ing>ARMA.  Specify  24  for  the  number  of  values  to  be 
forecast,  and  the  program  will  compute  forecasts  based  on  the  assumption 
that  the  estimated  seasonal  and  trend  components  are  true  values  and  that  {Yt} 
is  a  white  noise  sequence  with  zero  mean.  (This  is  the  default  model  assumed 
by  ITSM  until  a  more  complicated  stationary  model  is  estimated  or  specified.) 
The  original  data  are  plotted  with  the  forecasts  appended.  Eater  we  shall  see 
how  to  improve  on  these  forecasts  by  taking  into  account  the  dependence  in  the 
series  {FJ. 

1.19  Use  a  text  editor  to  construct  and  save  a  text  file  named  TEST.TSM,  which 
consists  of  a  single  column  of  30  numbers,  {xi,  . . . ,  X30},  defined  by 

JC1,  . . . ,  *10  :  486,  474,  434,  441,  435,  401,  414,  414,  386,  405; 

jcn, . . . ,  JC20  :  411,  389,  414,  426,  410,  441,  459,  449,  486,  510; 

x21,  . . . ,  x30  :  506>  549 >  579’  581>  630>  666>  674>  729 >  771’  785- 

This  series  is  in  fact  the  sum  of  a  quadratic  trend  and  a  period-three  seasonal 

component.  Use  the  program  ITSM  to  apply  the  filter  in  Problem  1.14  to  this 
time  series  and  discuss  the  results. 

(Once  the  data  have  been  typed,  they  can  be  imported  directly  into  ITSM 
by  highlighting  the  data  to  be  imported,  using  the  Windows  command  Select 
and  Copy  and  then,  in  ITSM,  selecting  the  option  File>Proj ect >New> 
Univariate,  clicking  on  OK  and  selecting  File>Import  Clipboard.) 
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2.1  Basic  Properties 

2.2  Linear  Processes 

2.3  Introduction  to  ARMA  Processes 

2.4  Properties  of  the  Sample  Mean  and  Autocorrelation  Function 

2.5  Forecasting  Stationary  Time  Series 

2.6  The  Wold  Decomposition 


A  key  role  in  time  series  analysis  is  played  by  processes  whose  properties,  or  some 
of  them,  do  not  vary  with  time.  If  we  wish  to  make  predictions,  then  clearly  we 
must  assume  that  something  does  not  vary  with  time.  In  extrapolating  deterministic 
functions  it  is  common  practice  to  assume  that  either  the  function  itself  or  one  of  its 
derivatives  is  constant.  The  assumption  of  a  constant  first  derivative  leads  to  linear 
extrapolation  as  a  means  of  prediction.  In  time  series  analysis  our  goal  is  to  predict 
a  series  that  typically  is  not  deterministic  but  contains  a  random  component.  If  this 
random  component  is  stationary,  in  the  sense  of  Definition  1.4.2,  then  we  can  develop 
powerful  techniques  to  forecast  its  future  values.  These  techniques  will  be  developed 
and  discussed  in  this  and  subsequent  chapters. 


2.1  Basic  Properties 


In  Section  1.4  we  introduced  the  concept  of  stationarity  and  defined  the  autocovari¬ 
ance  function  (ACVF)  of  a  stationary  time  series  {XJ  as 

y(h)  =  Co\(Xt+h,  Xt),  h  =  0,  ±1,  ±2,  . . . . 


The  autocorrelation  function  (ACF)  of  {X,}  was  defined  similarly  as  the  function  p(-) 
whose  value  at  lag  h  is 


Y(h) 

Y  (0) ' 
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The  ACVF  and  ACF  provide  a  useful  measure  of  the  degree  of  dependence  among 
the  values  of  a  time  series  at  different  times  and  for  this  reason  play  an  important  role 
when  we  consider  the  prediction  of  future  values  of  the  series  in  terms  of  the  past  and 
present  values.  They  can  be  estimated  from  observations  of  X\,  . . . ,  Xn  by  computing 
the  sample  ACVF  and  ACF  as  described  in  Section  1.4.1. 

The  role  of  the  autocorrelation  function  in  prediction  is  illustrated  by  the  following 
simple  example.  Suppose  that  {Xr}  is  a  stationary  Gaussian  time  series  (see  Defi¬ 
nition  A.3.2)  and  that  we  have  observed  Xn.  We  would  like  to  find  the  function  of 
Xn  that  gives  us  the  best  predictor  of  Xn+k,  the  value  of  the  series  after  another  h 
time  units  have  elapsed.  To  define  the  problem  we  must  first  say  what  we  mean  by 
“best.”  A  natural  and  computationally  convenient  definition  is  to  specify  our  required 
predictor  to  be  the  function  of  Xn  with  minimum  mean  squared  error.  In  this  illustration, 
and  indeed  throughout  the  remainder  of  this  book,  we  shall  use  this  as  our  criterion 
for  “best.”  Now  by  Proposition  A. 3.1  the  conditional  distribution  of  Xn+k  given  that 
Xyi  —  xn  is 

N(/z  +  p(h)(xn  -  /i),  cr2(l  -  p(h)2)), 

where  /z  and  cr2  are  the  mean  and  variance  of  {X,}.  It  was  shown  in  Problem  1.1  that 
the  value  of  the  constant  c  that  minimizes  E(Xn+h  —  c )2  is  c  —  E(Xn+h)  and  that  the 
function  m  of  Xn  that  minimizes  E(Xn+h  —  m(Xn ))2  is  the  conditional  mean 

m(Xn)  =  E(Xn+h  \Xn)  =  /x  +  p  (h)  (Xn  -  /z).  (2.1.1) 

The  corresponding  mean  squared  error  is 

E(Xn+h  -  m(Xn))2  =  a2 (l-  p(h)2).  (2.1.2) 

This  calculation  shows  that  at  least  for  stationary  Gaussian  time  series,  prediction  of 
Xn+h  in  terms  of  Xn  is  more  accurate  as  \p(h)\  becomes  closer  to  1,  and  in  the  limit  as 
p(h)  ->  ±1  the  best  predictor  approaches  /z  d=  (Xn  —  /z)  and  the  corresponding  mean 
squared  error  approaches  0. 

In  the  preceding  calculation  the  assumption  of  joint  normality  of  Xn+h  and  Xn 
played  a  crucial  role.  For  time  series  with  nonnormal  joint  distributions  the  correspond¬ 
ing  calculations  are  in  general  much  more  complicated.  However,  if  instead  of  looking 
for  the  best  function  of  Xn  for  predicting  Xn+h ,  we  look  for  the  best  linear  predictor, 
i.e.,  the  best  predictor  of  the  form  t  (Xn)  =  aXn  +  b ,  then  our  problem  becomes  that  of 
finding  a  and  b  to  minimize  E(Xn+h  —  aXn  —  b)2.  An  elementary  calculation  (Problem 
2.1),  shows  that  the  best  predictor  of  this  form  is 

i{Xn)  =  /z  +  p (h) (Xn  -  /z)  (2.1.3) 

with  corresponding  mean  squared  error 

E(Xn+h  -  l(Xn))2  =  cr2(l  -  pQij2).  (2.1.4) 

Comparison  with  (2.1.1)  and  (2.1.3)  shows  that  for  Gaussian  processes,  £(Xn)  and 
m(Xn)  are  the  same.  In  general,  of  course,  m(Xn)  will  give  smaller  mean  squared 
error  than  i (Xn),  since  it  is  the  best  of  a  larger  class  of  predictors  (see  Problem  1.8). 
However,  the  fact  that  the  best  linear  predictor  depends  only  on  the  mean  and  ACF  of 
the  series  {Xr}  means  that  it  can  be  calculated  without  more  detailed  knowledge  of  the 
joint  distributions.  This  is  extremely  important  in  practice  because  of  the  difficulty  of 
estimating  all  of  the  joint  distributions  and  because  of  the  difficulty  of  computing  the 
required  conditional  expectations  even  if  the  distributions  were  known. 
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As  we  shall  see  later  in  this  chapter,  similar  conclusions  apply  when  we  consider 
the  more  general  problem  of  predicting  Xn+h  as  a  function  not  only  of  Xn,  but  also  of 
Xn-i,  Xn_2,  •  •  •  •  Before  pursuing  this  question  we  need  to  examine  in  more  detail  the 
properties  of  the  autocovariance  and  autocorrelation  functions  of  a  stationary  time 
series. 


Basic  Properties  of  7O): 

y  (0)  >  0, 

\y(h)\  <  y  (0)  for  all  h , 
and  y  (•)  is  even,  i.e., 

y(h)  —  y(—h )  for  all  h. 


Proof  The  first  property  is  simply  the  statement  that  Var(Xr)  >  0,  the  second  is  an  immediate 
consequence  of  the  fact  that  correlations  are  less  than  or  equal  to  1  in  absolute  value 
(or  the  Cauchy-Schwarz  inequality),  and  the  third  is  established  by  observing  that 

y(h)  =  Cow(Xt+h,  Xt)  =  Co\'(X,,  Xt+h)  =  y(—h).  ■ 


Autocovariance  functions  have  another  fundamental  property,  namely  that  of 
nonnegative  definiteness. 


Definition  2.1.1 


A  real-valued  function  k  defined  on  the  integers  is  nonnegative  definite  if 

n 

ciiK(i  —  j) a j  >  0  (2.1.5) 

Uj=  1 

for  all  positive  integers  n  and  vectors  a  =  (a\, ,  an)r  with  real-valued  compo¬ 
nents  CLi. 


Theorem  2.1.1  A  real-valued  function  defined  on  the  integers  is  the  autocovariance  function  of  a 

stationary  time  series  if  and  only  if  it  is  even  and  nonnegative  definite. 

Proof  To  show  that  the  autocovariance  function  y(-)  of  any  stationary  time  series  {XJ  is 
nonnegative  definite,  let  a  be  any  n  x  1  vector  with  real  components  a\, ...  ,an  and  let 
Xn  =  (Xn,  . . . ,  X\  f.  Then  by  equation  (A.2.5)  and  the  nonnegativity  of  variances, 

n 

Var(a'X„)  =  aT„a  =  ^  a,y(i  -  j)cij  >  0, 

i,j=  1 

where  Tn  is  the  covariance  matrix  of  the  random  vector  Xn.  The  last  inequality, 
however,  is  precisely  the  statement  that  y(-)  is  nonnegative  definite.  The  converse 
result,  that  there  exists  a  stationary  time  series  with  autocovariance  function  k  if 
k  is  even,  real-valued,  and  nonnegative  definite,  is  more  difficult  to  establish  (see 
Brockwell  and  Davis  (1991),  Theorem  1.5.1  for  aproof).  A  slightly  stronger  statement 
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Example  2.1.1 


can  be  made,  namely,  that  under  the  specified  conditions  there  exists  a  stationary 
Gaussian  time  series  {Xt}  with  mean  0  and  autocovariance  function  /<;(•)•  ■ 


Remark  1.  An  autocorrelation  function  p(-)  has  all  the  properties  of  an  autocovari¬ 
ance  function  and  satisfies  the  additional  condition  p(0)  =  1.  In  particular,  we  can  say 
that  p(-)  is  the  autocorrelation  function  of  a  stationary  process  if  and  only  if  p(-)  is  an 
AC VF  with  p  (0)  =  1.  □ 


Remark  2.  To  verify  that  a  given  function  is  nonnegative  definite  it  is  often  simpler 
to  find  a  stationary  process  that  has  the  given  function  as  its  ACVF  than  to  verify  the 
conditions  (2.1.5)  directly.  For  example,  the  function  k(K)  =  cos (coh)  is  nonnegative 
definite,  since  (see  Problem  2.2)  it  is  the  ACVF  of  the  stationary  process 

Xt  =  A  cos(cot)  +  B  sin  (cot), 

where  A  and  B  are  uncorrelated  random  variables,  both  with  mean  0  and  variance  1. 
Another  illustration  is  provided  by  the  following  example.  □ 


We  shall  show  now  that  the  function  defined  on  the  integers  by 


*(h)  =  j  P,  if  h  —  ±1, 
[0,  otherwise, 


is  the  ACVF  of  a  stationary  time  series  if  and  only  if  |p|  <1  .  Inspection  of  the  ACVF 
of  the  MA(I)  process  of  Example  1.4.4  shows  that  k  is  the  ACVF  of  such  a  process  if 
we  can  find  real  9  and  nonnegative  a1  such  that 

o2(  1  +  6>2)  =  1 

and 

g29  —  p. 

If  |p |  <  \,  these  equations  give  solutions  9  =  (2p)-1(l  d=  y  1  —  4p2)  and  a2  — 

(l  +  9 2)  .  However,  if  |p|  >  \  ,  there  is  no  real  solution  for  9  and  hence  no  MA(1) 

process  with  ACVF  k.  To  show  that  there  is  no  stationary  process  with  ACVF  k, 
we  need  to  show  that  k  is  not  nonnegative  definite.  We  shall  do  this  directly  from  the 
definition  (2.1.5).  First,  if  p  >  i  ,K=[/c(i—  j)]"j=1,  and  a  is  the  ^-component  vector 
a  =  (1,  —1,  1,  —1, . .  .y,  then 

a^a  =  n  —  2(n  —  1  )p  <  0  for  n  >  2p/(2p  —  1), 

showing  that  k(-)  is  not  nonnegative  definite  and  therefore,  by  Theorem  2.1.1,  is  not 
an  autocovariance  function.  If  p  <  ,  the  same  argument  with  a  =  (1,  1,  1,1,.. .)' 

again  shows  that  k(-)  is  not  nonnegative  definite. 

,  □ 

If  {A/}  is  a  (weakly)  stationary  time  series,  then  the  vector  (X\, ,  Xn)  and  the 
time-shifted  vector  (X\ +^, . . . ,  Xn+h )'  have  the  same  mean  vectors  and  covariance 
matrices  for  every  integer  h  and  positive  integer  n.  A  strictly  stationary  sequence  is 
one  in  which  the  joint  distributions  of  these  two  vectors  (and  not  just  the  means  and 
covariances)  are  the  same.  The  precise  definition  is  given  below. 
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Definition  2.1.2 


{X^}  is  a  strictly  stationary  time  series  if 


(*i 


d 


Xn)  —  (X\  +h »  •  •  •  j  Xn-\-h) 


for  all  integers  h  and  n  >  1 .  (Here  =  is  used  to  indicate  that  the  two  random  vectors 
have  the  same  joint  distribution  function.) 


For  reference,  we  record  some  of  the  elementary  properties  of  strictly  stationary 
time  series. 


Properties  of  a  Strictly  Stationary  Time  Series  {Xt}: 

a.  The  random  variables  Xt  are  identically  distributed. 

b.  (Xt,  Xt+hY  =  (X\,X\ +k)r  for  all  integers  t  and  h. 

c.  {X/}  is  weakly  stationary  if  E(Xf)  <  oo  for  all  t. 

d.  Weak  stationarity  does  not  imply  strict  stationarity. 

e.  An  iid  sequence  is  strictly  stationary. 


Proof  Properties  (a)  and  (b)  follow  at  once  from  Definition  2.1.2.  If  EXf  <  oo,  then  by 
(a)  and  (b)  EXt  is  independent  of  t  and  Cov(Xt,Xt+h)  =  Cov(Xi,Xi +/z),  which  is 
also  independent  of  t,  proving  (c).  For  (d)  see  Problem  1.8.  If  {X^}  is  an  iid  sequence 
of  random  variables  with  common  distribution  function  F,  then  the  joint  distribution 
function  of  (Xi+^, . . . ,  Xn+h)f  evaluated  at  (x\ ,  . . . ,  xn)f  is  F(x\)  •  •  •  F(xn),  which  is 
independent  of  h.  ■ 

One  of  the  simplest  ways  to  construct  a  time  series  {Xr}  that  is  strictly  stationary 
(and  hence  stationary  if  EXf  <  oo)  is  to  “filter”  an  iid  sequence  of  random  variables. 
Let  {Zt}  be  an  iid  sequence,  which  by  (e)  is  strictly  stationary,  and  define 

Xt  =  g(Zt,  Zt- 1, . . . ,  Zt—q)  (2.1.6) 

for  some  real- valued  function  g(-,...,-).  Then  {XJ  is  strictly  stationary,  since 

(Zt+h,  •  •  •  5  Zt+h-qy  =  (Zh  . . . ,  Zt_q)'  for  all  integers  h.  It  follows  also  from  the 
defining  equation  (2.1.6)  that  {Xr}  is  ^-dependent,  i.e.,  thatX^  andXr  are  independent 
whenever  \t  —  s\  >  q.  (An  iid  sequence  is  O-dependent.)  In  the  same  way,  adopting 
a  second-order  viewpoint,  we  say  that  a  stationary  time  series  is  ^-correlated  if 
y(h)  =  0  whenever  \h\  >  q.  A  white  noise  sequence  is  then  O-correlated,  while 
the  MA(1)  process  of  Example  1.4.4  is  1-correlated.  The  moving-average  process  of 
order  q  defined  below  is  ^-correlated,  and  perhaps  surprisingly,  the  converse  is  also 
true  (Proposition  2.1.1). 


The  MA(<jr)  Process: 

{X?}  is  a  moving-average  process  of  order  q  if 

Xt  —  Zt  +  9\Zt_\  +  •  •  •  +  6qZt_q,  (2.1.7) 

where  {Zt}  ~  WN(0,  a2)  and  9\,  ...  ,9q  are  constants. 
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It  is  a  simple  matter  to  check  that  (2.1.7)  defines  a  stationary  time  series  that  is  strictly 
stationary  if  {Zt}  is  iid  noise.  In  the  latter  case,  (2.1.7)  is  a  special  case  of  (2.1.6)  with 
g  a  linear  function. 

The  importance  of  MA(g)  processes  derives  from  the  fact  that  every  ^-correlated 
process  is  an  MA(g)  process.  This  is  the  content  of  the  following  proposition,  whose 
proof  can  be  found  in  Brockwell  and  Davis  (1991),  Section  3.2.  The  extension  of  this 
result  to  the  case  q  =  oo  is  essentially  Wold’s  decomposition  (see  Section  2.6). 

Proposition  2.1.1  If  {Xr}  is  a  stationary  q-correlated  time  series  with  mean  0,  then  it  can  be  represented 

as  the  MA(g)  process  in  (2.1.7). 


2.2  Linear  Processes 

The  class  of  linear  time  series  models,  which  includes  the  class  of  autoregressive 
moving-average  (ARMA)  models,  provides  a  general  framework  for  studying 
stationary  processes.  In  fact,  every  second-order  stationary  process  is  either  a  linear 
process  or  can  be  transformed  to  a  linear  process  by  subtracting  a  deterministic  com¬ 
ponent.  This  result  is  known  as  Wold’s  decomposition  and  is  discussed  in  Section  2.6. 


Definition  2.2.1 


The  time  series  {XJ  is  a  linear  process  if  it  has  the  representation 

(X) 

x,=  72  t izH .  (2-2.1) 

j=-oo 

for  all  t,  where  {Zt}  ~  WN(0,  a2)  and  {fj}  is  a  sequence  of  constants  with 

Ej^-oc  IV7I  < 


In  terms  of  the  backward  shift  operator  B ,  (2.2.1)  can  be  written  more  compactly  as 

Xt  =  f(B)Zu  (2.2.2) 

where  f(B)  =  Yljl-00  V^7-  A  linear  process  is  called  a  moving  average  or  MA(oo) 
if  f  j  =  0  for  all  j  <  0,  i.e.,  if 

00 

Xt  =  IrjZt-j. 
j= 0 

Remark  1.  The  condition  Leo  IVol  <  00  ensures  that  the  infinite  sum  in  (2.2.1) 
converges  (with  probability  one),  since  E\Zt\  <  a  and 

00  /  00  \ 

E\Xt\  <  E  s  E  \fj\  I  O  <  OO. 

j=-oo  y=-oo  J 

It  also  ensures  that  Jfjl-oo  <  00  an^  hence  (see  Appendix  C,  Example  C.1.1)  that 
the  series  in  (2.2.1)  converges  in  mean  square,  i.e.,  that  Xt  is  the  mean  square  limit 
of  the  partial  sums  Y^j=-n  The  condition  T"!!-  <  00  also  ensures 

convergence  in  both  senses  of  the  more  general  series  (2.2.3)  considered  in 
Proposition  2.2.1  below.  In  Section  11.4  we  consider  a  more  general  class  of  linear 
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processes,  the  fractionally  integrated  ARMA  processes,  for  which  the  coefficients  are 
not  absolutely  summable  but  only  square  summable.  □ 

The  operator  \[r(B)  can  be  thought  of  as  a  linear  filter,  which  when  applied  to 
the  white  noise  “input”  series  {Zt}  produces  the  “output”  {XJ  (see  Section  4.3).  As 
established  in  the  following  proposition,  a  linear  filter,  when  applied  to  any  stationary 
input  series,  produces  a  stationary  output  series. 

Proposition  2.2.1  Let  [Yt]  be  a  stationary  time  series  with  mean  0  and  covariance  function  yy-  If 

YlyL-oo  IVO'I  <  °°>  t^Len  the  time  series 

oo 

xt=  J2  Wt-j  =  n B)Y t  (2.2.3) 

j=-o o 

is  stationary  with  mean  0  and  autocovariance  function 

oo  oo 

Yx(h)  =  ^  X  ^kVAh  +  k-j).  (2.2.4) 

j=—o o  k=—o o 

In  the  special  case  where  {Xr}  is  the  linear  process  (2.2.1), 

oo 

Yxih)  =  ^  'I'j'I'j+hV2 ■  (2.2.5) 

j=~  oo 


Proof  The  argument  used  in  Remark  1,  with  a  replaced  by  y/]/y( 0),  shows  that  the  series  in 
(2.2.3)  is  convergent.  Since  EYt  —  0,  we  have 


oo 


oo 


E(xt)  =  £[  a  irjYH  =  yy  w,.-) = o 


j=-o o 


J=~oo 


and 


E(Xt+hXt)  =  E 


oo  \  /  oo 

^j^t+h-j  J  (  fkYt-k 

j=—oo  J  \k=—oo 


oo  oo 


E  E  fjfkE(Yt+h_jYt_k) 

j=  —  OQ  k=  —  OO 


OO  OO 


E  E  fjfkYrih  ~j  +  k), 

j=  —  OQ  k=  —  OO 


which  shows  that  {Xr}  is  stationary  with  covariance  function  (2.2.4).  (The  interchange 
of  summation  and  expectation  operations  in  the  above  calculations  can  be  justified  by 
the  absolute  summability  of  fj.)  Finally,  if  {Yt)  is  the  white  noise  sequence  {Zt}  in 
(2.2.1),  then  yY(h  —  j  +  k)  =  o2  if  k  =  j  —  h  and  0  otherwise,  from  which  (2.2.5) 
follows.  ■ 


Remark  2.  The  absolute  convergence  of  (2.2.3)  implies  (Problem  2.6)  that  filters  of 
the  form  a  (B)  =  J2jl-ooajBj  and  j^C®)  =  Yljl- ooPj^  with  absolutely  summable 
coefficients  can  be  applied  successively  to  a  stationary  series  {Yt}  to  generate  a  new 
stationary  series 


46 


Chapter  2  Stationary  Processes 


Example  2.2.1 


OO 

w,  =  E  W'-h 

j=-oo 

where 

OO  OO 

VO  —  ^  '  O^kfij—k  —  ^  '  fik^j—k' 

k— — oo  k=—o o 

These  relations  can  be  expressed  in  the  equivalent  form 

wo  = 

where 


(2.2.6) 


^(J?)  =  a  (B) /3(B)  =  /3(B)a(B),  (2.2.7) 

and  the  products  are  defined  by  (2.2.6)  or  equivalently  by  multiplying  the  series 
ajBj  and  Yljl-o o  PjB  '  term  by  term  and  collecting  powers  of  5.  It  is  clear 
from  (2.2.6)  and  (2.2.7)  that  the  order  of  application  of  the  filters  a(B)  and  /3(B)  is 
immaterial.  □ 

An  AR(1)  Process 

In  Example  1.4.5,  an  AR(1)  process  was  defined  as  a  stationary  solution  {Xr}  of  the 
equations 

X,  -  (/)Xt_A  =  Zt,  (2.2.8) 

where  {Z,}  ~  WN(0,  ex2),  |0|  <  1,  and  Zt  is  uncorrelated  with  Xs  for  each  set.  To 
show  that  such  a  solution  exists  and  is  the  unique  stationary  solution  of  (2.2.8),  we 
consider  the  linear  process  defined  by 

oo 

Xt  =  Y^VZt-y  (2.2.9) 

j= o 

(The  coefficients  <fij  for  j  >  0  are  absolutely  summable,  since  \<p\  <  1.)  It  is  easy  to 
verify  directly  that  the  process  (2.2.9)  is  a  solution  of  (2.2.8),  and  by  Proposition  2.2.1 
it  is  also  stationary  with  mean  0  and  ACVF 

00  2 

Yx(h)  =  0-W+V2  =  , 

;= o  ^ 

for  h  >  0. 

To  show  that  (2.2.9)  is  the  only  stationary  solution  of  (2.2.8)  let  { Yt }  be  any 
stationary  solution.  Then,  iterating  (2.2.8),  we  obtain 

Yt  =  $Yt_x+Zt 

—  Zt  +  <pZt-\  +  <fi2Yt_  2 


—  Zt  +  <fiZt_  i  +  •  •  •  +  1  i . 

If  {TJ  is  stationary,  then  EY2  is  finite  and  independent  of  t ,  so  that 

£(yr  -  J2  <t>] z<-i )2  =  02*+2£(yt-*_i)2 
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This  implies  that  Yt  is  equal  to  the  mean  square  limit  Yljlo  and  hence  that  the 

process  defined  by  (2.2.9)  is  the  unique  stationary  solution  of  equation  (2.2.8). 

It  the  case  |0|  >  1,  the  series  in  (2.2.9)  does  not  converge.  However,  we  can  rewrite 

(2.2.8)  in  the  form 

X,  =  -^Z,+x  +  (2.2.10) 

Iterating  (2.2.10)  gives 

x,  =  -<p~lzt+l  -  (t>~2zl+ 2  +  4>-2xt+2 

_  •  •  • 

=  —(j)~lZt+i - (/>-k~lzt+k+l  +  <t>-k~lxt+k+l, 

which  shows,  by  the  same  arguments  used  above,  that 

oo 

V  =  -J2<P~JZ<+,  (2.2.11) 

7=1 

is  the  unique  stationary  solution  of  (2.2.8).  This  solution  should  not  be  confused  with 
the  nonstationary  solution  {Xr}  of  (2.2.8)  obtained  when  Xq  is  any  specified  random 
variable  that  is  uncorrelated  with  {Zt}. 

The  solution  (2.2.11)  is  frequently  regarded  as  unnatural,  since  Xt  as  defined  by 
(2.2.11)  is  correlated  with  future  values  of  Zs,  contrasting  with  the  solution  (2.2.9), 
which  has  the  property  that  Xt  is  uncorrelated  with  Zs  for  all  s  >  t.  It  is  customary 
therefore  in  modeling  stationary  time  series  to  restrict  attention  to  AR(1)  processes 
with  |0|  <  1.  Then  Xt  has  the  representation  (2.2.8)  in  terms  of  { Zs ,  s  <  t },  and  we 
say  that  {XJ  is  a  causal  or  future-independent  function  of  {Zt},  or  more  concisely  that 
{Xr}  is  a  causal  autoregressive  process.  It  should  be  noted  that  every  AR(1)  process  with 
101  >  1  can  be  reexpressed  as  an  AR(1)  process  with  |0|  <  1  and  a  new  white  noise 
sequence  (Problem  3.8).  From  a  second-order  point  of  view,  therefore,  nothing  is  lost 
by  eliminating  AR(1)  processes  with  |0|  >  1  from  consideration. 

If  0  =  ±1,  there  is  no  stationary  solution  of  (2.2.8)  (see  Problem  2.8). 

□ 

Remark  3.  It  is  worth  remarking  that  when  |0|  <  1  the  unique  stationary  solution 

(2.2.9)  can  be  found  immediately  with  the  aid  of  (2.2.7).  To  do  this  let  0(5)  —  l  —  (j)B 
and  n(B)  =  Yl%{)(t)jBj ■  Then 

0(5)  :=0(5)tt(5)  =  1. 

Applying  the  operator  7t(5)  to  both  sides  of  (2.2.8),  we  obtain 

oo 

Xl  =  7T(B)Z,  =  J2<PJZl-j 

7=0 

as  claimed.  □ 


2.3  Introduction  to  ARMA  Processes 

In  this  section  we  introduce,  through  an  example,  some  of  the  key  properties  of  an 
important  class  of  linear  processes  known  as  ARMA  (autoregressive  moving  average) 
processes.  These  are  defined  by  linear  difference  equations  with  constant  coefficients. 
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As  our  example  we  shall  consider  the  ARM  A(  1,1)  process.  Higher-order  ARM  A 
processes  will  be  discussed  in  Chapter  3. 


Definition  2.3.1 


The  time  series  {X^}  is  an  ARMA(1, 1)  process  if  it  is  stationary  and  satisfies  (for 
every  t) 

Xt  -  <t>Xt-r  =  Zt  +  OZt. i,  (2.3.1) 

where  [Zt]  ~  WN(0,  <t2)  and  0  +  0/0. 


Using  the  backward  shift  operator  B ,  (2.3.1)  can  be  written  more  concisely  as 
<KB)Xt  =  9(B)Zt,  (2.3.2) 

where  0  (. B )  and  9  (B)  are  the  linear  filters 


0(5)  =  l  -05  and  0(B)  =  1+05, 
respectively. 

We  first  investigate  the  range  of  values  of  0  and  9  for  which  a  stationary  solution 
of  (2.3.1)  exists.  If  |0|  <  1,  let  x(z)  denote  the  power  series  expansion  of  l/0(z), 
i.e.,  £~O0V,  which  has  absolutely  summable  coefficients.  Then  from  (2.2.7)  we 
conclude  that  x(5)0(5)  =  1.  Applying  /  (5)  to  each  side  of  (2.3.2)  therefore  gives 

Xt  =  x(B)9(B)Zt  =  0C B)ZU 


where 


f(B)  =  fjBj  =  (1  +<t>B  +  02S2  +  •  •  • )  (1  +  OB) . 

j= o 


By  multiplying  out  the  right-hand  side  or  using  (2.2.6),  we  find  that 
0o  —  1  and  0y  =  (0  +  0)0-7_1  for  j  >  1. 


As  in  Example  2.2.1,  we  conclude  that  the  MA(oo)  process 

oo 

Xt  =  Zt  +  ((t>  +  d)J2  (2-3.3) 

j=  1 

is  the  unique  stationary  solution  of  (2.3.1). 

Now  suppose  that  1 0 1  >  1 .  We  first  represent  1  /0  (z)  as  a  series  of  powers  of  z  with 
absolutely  summable  coefficients  by  expanding  in  powers  of  z-1,  giving  (Problem  2.7) 


1 

000 


Then  we  can  apply  the  same  argument  as  in  the  case  where  |0|  <  1  to  obtain  the 
unique  stationary  solution  of  (2.3.1).  We  let  x(£)  =  —  0-7#-7  and  apply  x(£) 

to  each  side  of  (2.3.2)  to  obtain 

oo 

Xt  =  x (B)9(B)Zt  =  —9(f)-lZt  -(d  + 


(2.3.4) 
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If  0  =  ±1,  there  is  no  stationary  solution  of  (2.3.1).  Consequently,  there  is  no 
such  thing  as  an  ARM A(  1,1)  process  with  0  =  ±1  according  to  our  definition. 

We  can  now  summarize  our  findings  about  the  existence  and  nature  of  the  sta¬ 
tionary  solutions  of  the  ARMA(1,1)  recursions  (2.3.2)  as  follows: 

•  A  stationary  solution  of  the  ARM A(  1,1)  equations  exists  if  and  only  if  0  ^  ±1. 

•  If  |0|  <  1,  then  the  unique  stationary  solution  is  given  by  (2.3.3).  In  this  case  we 
say  that  {XJ  is  causal  or  a  causal  function  of  {ZJ,  since  Xt  can  be  expressed  in 
terms  of  the  current  and  past  values  Zs,  s  <  t. 

•  If  |0|  >  1,  then  the  unique  stationary  solution  is  given  by  (2.3.4).  The  solution  is 
noncausal,  since  Xt  is  then  a  function  of  Zs,s  >  t. 

Just  as  causality  means  that  Xt  is  expressible  in  terms  of  Zs  ,s  <t ,  the  dual  concept 
of  invertibility  means  that  Zt  is  expressible  in  terms  of  Xs,  s  <  t.  We  show  now  that 
the  ARMA(1,1)  process  defined  by  (2.3.1)  is  invertible  if  \0\  <  1.  To  demonstrate 
this,  let  £ (z)  denote  the  power  series  expansion  of  1  /6(z),  i.e.,  ^ ,  which  has 

absolutely  summable  coefficients.  From  (2.2.6)  it  therefore  follows  that  £  (. B)6(B )  =  1, 
and  applying  %(B)  to  each  side  of  (2.3.2)  gives 

Zt  =  $m(B)Xt  =  7T(B)Xt, 

where 

oo 

iz(B)  =  niBJ  =  (1  -9B  +  (- 0)2B 2  +  •  •  • )  (1 

j= o 

By  multiplying  out  the  right-hand  side  or  using  (2.2.6),  we  find  that 

oo 

Z,  =  Xt  -  (<t>  +  9)  (2.3.5) 

7=1 

Thus  the  ARM A(  1,1)  process  is  invertible,  since  Zt  can  be  expressed  in  terms  of  the 
present  and  past  values  of  the  process  Xs,  s  <  t.  An  argument  like  the  one  used  to 
show  noncausality  when  |0|  >  1  shows  that  the  ARM A(  1,1)  process  is  noninvertible 
when  \6\  >  1,  since  then 

oo 

Zt  =  -4)9-%  +  (9  +  4>)  J2(-erj-%+j-  (2.3.6) 

7=1 

We  summarize  these  results  as  follows: 

•  If  \6\  <  1,  then  the  ARM A(  1,1)  process  is  invertible,  and  Zt  is  expressed  in  terms 
of  Xs,  s  <  t,  by  (2.3.5). 

•  If  \6\  >  1,  then  the  ARM A(  1,1)  process  is  noninvertible,  and  Zt  is  expressed  in 
terms  of  XS9  s  >  t,  by  (2.3.6). 

Remark  1.  In  the  cases  6  =  ±1,  the  ARM A(  1,1)  process  is  invertible  in  the  more 
general  sense  that  Zt  is  a  mean  square  limit  of  finite  linear  combinations  of  Xs  ,  s  <t, 
although  it  cannot  be  expressed  explicitly  as  an  infinite  linear  combination  of  Xs  ,  s  < 
t  (see  Section  4.4  of  Brockwell  and  Davis  (1991)).  In  this  book  the  term  invertible 
will  always  be  used  in  the  more  restricted  sense  that  Zt  —  J2jlonjXt-ji  where 

£~0  1 7Tj\  <  OO.  □ 
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Remark  2.  If  the  ARM A(  1,1)  process  {Xr}  is  noncausal  or  noninvertible  with  \9\  >  1, 
then  it  is  possible  to  find  a  new  white  noise  sequence  such  that  {Xr}  is  a  causal 
and  noninvertible  ARMA(1,1)  process  relative  to  {Wt}  (Problem  4.10).  Therefore, 
from  a  second-order  point  of  view,  nothing  is  lost  by  restricting  attention  to  causal 
and  invertible  ARM A(  1,1)  models.  This  last  sentence  is  also  valid  for  higher-order 
ARMA  models.  □ 


2.4  Properties  of  the  Sample  Mean  and  Autocorrelation  Function 

A  stationary  process  {Xr}  is  characterized,  at  least  from  a  second-order  point  of  view, 
by  its  mean  /z  and  its  autocovariance  function  y  (•).  The  estimation  of  /z,  y  (•),  and  the 
autocorrelation  function  p(-)  =  y(-)/y( 0)  from  observations  X\, ...  ,Xn  therefore 
plays  a  crucial  role  in  problems  of  inference  and  in  particular  in  the  problem  of 
constructing  an  appropriate  model  for  the  data.  In  this  section  we  examine  some  of 
the  properties  of  the  sample  estimates  x  and  /)(•)  of  /z  and  p(-),  respectively. 


2.4.1  Estimation  of  /z 

The  moment  estimator  of  the  mean  /z  of  a  stationary  process  is  the  sample  mean 

Xn  =  n  1  (Xi  +  X2  +  ■  •  ■  +  Xn).  (2.4.1) 

It  is  an  unbiased  estimator  of  /z,  since 

E(Xn)  =  n~\EX  1  +  •  •  •  +  EXn )  =  /z. 

The  mean  squared  error  of  Xn  is 
E(X„  -  /z)2  =  Var(X„) 


n  n 


-2 


n  -  >  7  Co v(Xi,Xj) 

i=  1  j=  1 


n 


—  n 


-2 


E  (n  ~  I i~j\)Y(i~j) 


i—J=—n 


n 


\h\ 


=  «-'E  1-7^ 


(2.4.2) 


h=—n 


Now  if  y(h)  ->  0  as  h  — >  00,  the  right-hand  side  of  (2.4.2)  converges  to  zero, 
so  that  Xn  converges  in  mean  square  to  /z.  If  J2hL-oo  l/WI  <  then  (2.4.2) 
gives  lim^oo  nVav(Xn)  =  ^2\h\<OQ  Y (h) •  We  record  these  results  in  the  following 
proposition. 


Proposition  2.4.1  If  {Xr}  is  a  stationary  time  series  with  mean  fi  and  autocovariance  function  y(-), 

then  as  n  ->  00, 

Var(X„)  =  E(Xn  -  /x)2  ^  0  if  y(n)  ^  0, 


oo 


nE(Xn  -  lx)' 


vQi)  if  E 

\h\<oo 


h=—oo 


<  OO. 
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To  make  inferences  about  /x  using  the  sample  mean  Xn,  it  is  necessary  to  know  the 
distribution  or  an  approximation  to  the  distribution  ofXn.  If  the  time  series  is  Gaussian 
(see  Definition  A.3.2),  then  by  Remark  2  of  Section  A.3  and  (2.4.2), 

«1/2(x„-/z)~n|o,  J2  (i-i^ 

y  \h\<n  ' 

It  is  easy  to  construct  exact  confidence  bounds  for  /x  using  this  result  if  y(-)  is 
known,  and  approximate  confidence  bounds  if  it  is  necessary  to  estimate  y(-)  from 
the  observations. 

For  many  time  series,  in  particular  for  linear  and  ARMA  models,  Xn  is  approxi¬ 
mately  normal  with  mean  /x  and  variance  rTl  ^2\h\<00  y  (h)  for  large  n  (see  Brockwell 
and  Davis  (1991),  p.  219).  An  approximate  95  %  confidence  interval  for  /x  is  then 

(Xn  -  1.96 v1/2/V^,  Xn  +  1.96v1/2/v^)  ,  (2.4.3) 

where  v  =  J2\h\<oo  Y  00-  Of  course,  v  is  not  generally  known,  so  it  must  be  estimated 
from  the  data.  The  estimator  computed  in  the  program  ITSM  is  v  =  J2\h\<y/n{  1  — 
\h\/^/n)y(h).  For  ARMA  processes  this  is  a  good  approximation  to  v  for  large  n. 

An  AR(1)  Model 

Let  {Xt}  be  an  AR(1)  process  with  mean  /x,  defined  by  the  equations 

xt  — 1±  —  4>{xt- 1  —  /x)  +  z„ 

where  \(j)\  <  1  and  { Zt }  ~  WN(0,  cr2).  From  Example  2.2.1  we  have  y(h)  — 
0|/7|<j2/(1  —  02)  and  hence  v  =  (l  +  2  J2hL\  </)/7)<j2/(^  ~(t)2)  —  a2/(l  —  0)2-  Approx¬ 
imate  95  %  confidence  bounds  for  /x  are  therefore  given  by  xn  d=  1.96<j^_1/2/(1  —  0). 
Since  0  and  a  are  unknown  in  practice,  they  must  be  replaced  in  these  bounds  by 
estimated  values. 

□ 


2.4.2  Estimation  of  7O  and  />(•) 


Recall  from  Section  1 .4. 1  that  the  sample  autocovariance  and  autocorrelation  functions 
are  defined  by 

n—\h\ 

y(K)=n~lY,  ( X‘+\h\  -  V)  {Xt  -  Xn)  (2.4.4) 

t=  1 


(2.4.5) 


Both  the  estimators  y  ( h )  and  p(h)  are  biased  even  if  the  factor  n~l  in  (2.4.4)  is  replaced 
by  0 n  ~  h)~l .  Nevertheless,  under  general  assumptions  they  are  nearly  unbiased  for 
large  sample  sizes.  The  sample  ACVF  has  the  desirable  property  that  for  each  k  >  1 
the  ^-dimensional  sample  covariance  matrix 
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A 


y(0) 

Y(  1) 

•••  y(k  —  1) 

Y(  1) 

• 

9(0) 

• 

•••  y(k-  2) 

• 

• 

• 

_Y(k-  1) 

• 

• 

Y(k-  2) 

• 

•  •  • 

• 

•••  9(0) 

(2.4.6) 


is  nonnegative  definite.  To  see  this,  first  note  that  if  Tm  is  nonnegative  definite,  then 

A 

Fk  is  nonnegative  definite  for  all  k  <  m.  So  assume  k  >  n  and  write 
f*  =  n-lTT\ 


where  T  is  the  k  x  2k  matrix 


T  = 


0 

0 


0 

0 


o  Fi 
Yi  Y2 


0  Y  i  Y2 


y2 


Yi 


Yk 

0 


Yt  0  •  •  •  0 


Yj  =  X,  —  Xn,  i  =  1 .....  n,  and  Y,  =  0  lor  /  =  //  I  1 . k.  Then  for  any  real  k  x  1 

vector  a  we  have 


aTka  =  n-l(a'T)(T'a)  >  0, 


(2.4.7) 


and  consequently  the  sample  autocovariance  matrix  Fk  and  sample  autocorrelation 
matrix 

Rk  =  rk/Y(0)  (2.4.8) 

are  nonnegative  definite.  Sometimes  the  factor  n~l  is  replaced  by  (n  —  h)~l  in  the 

A  /V 

definition  of  y(h),  but  the  resulting  covariance  and  correlation  matrices  Tn  and  Rn 
may  not  then  be  nonnegative  definite.  We  shall  therefore  use  the  definitions  (2.4.4) 
and  (2.4.5)  of  y(h)  and  p(h). 

A  /V 

Remark  1.  The  matrices  Vk  and  Rk  are  in  fact  nonsingular  if  there  is  at  least  one 
nonzero  F;,  or  equivalently  if  y  (0)  >  0.  To  establish  this  result,  suppose  that  y  (0)  >  0 

A 

and  14  is  singular.  Then  there  is  equality  in  (2.4.7)  for  some  nonzero  vector  a,  implying 
that  a'T  =  0  and  hence  that  the  rank  of  T  is  less  than  k.  Let  be  the  first  nonzero 
value  of  YuY2,...,Yk9  and  consider  the  k  x  k  submatrix  of  T  consisting  of  columns 
(/  +  1)  through  (/  +  k).  Since  this  matrix  is  lower  right  triangular  with  each  diagonal 
element  equal  to  F;,  its  determinant  has  absolute  value  |  Yt\k  ^  0.  Consequently,  the 
submatrix  is  nonsingular,  and  T  must  have  rank  k,  a  contradiction.  □ 


Without  further  information  beyond  the  observed  data  X\, ... ,  Xn,  it  is  impos¬ 
sible  to  give  reasonable  estimates  of  y(h)  and  p(h)  for  h  >  n.  Even  for  h  slightly 
smaller  than  n ,  the  estimates  y(h)  and  p(h)  are  unreliable,  since  there  are  so  few  pairs 
(Xt+h,  Xt)  available  (only  one  if  h  =  n  —  1).  A  useful  guide  is  provided  by  Jenkins 
(1976),  p.  33  who  suggest  that  n  should  be  at  least  about  50  and  h  <  n/4. 

The  sample  ACF  plays  an  important  role  in  the  selection  of  suitable  models  for 
the  data.  We  have  already  seen  in  Example  1.4.6  and  Section  1.6  how  the  sample  ACF 
can  be  used  to  test  for  iid  noise.  For  systematic  inference  concerning  p(h),  we  need 
the  sampling  distribution  of  the  estimator  p(h).  Although  the  distribution  of  p(h)  is 
intractable  for  samples  from  even  the  simplest  time  series  models,  it  can  usually  be 
well  approximated  by  a  normal  distribution  for  large  sample  sizes.  For  linear  models 
and  in  particular  for  ARMA  models  (see  Theorem  7.2.2  of  Brockwell  and  Davis  (1991) 
for  exact  conditions)  pk  =  (p(l),  . . . ,  p(k))'  is  approximately  distributed  for  large  n 
as  N(pk,  n~l  W),  i.e., 
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Example  2.4.3 


p  ^  N(p,n~lW),  (2.4.9) 

where  p  =  (p(l), . . . ,  p(k))',  and  W  is  the  covariance  matrix  whose  (/.  j)  element 
is  given  by  Bartlett’s  formula 

OO 

Wjj=  ^2  {p(k  +  i)p(k+j)  + p(k-i)p(k+j)  +  2p(i)p(j)p2(k) 

k=—o o 

-  2 p(i)p(k)p(k  +j)  -  2p(j)p(k)p(k  +  /)}• 

Simple  algebra  shows  that 


Wij  —  Y\p(k  +  i)  +  p(k  -  i)  -  2p(i)p(k)} 

k~l  x  {p(k  +j)  +  p(k  -j)  -  2p(j)p(k)},  (2.4.10) 

which  is  a  more  convenient  form  of  w#  for  computational  purposes. 

iid  Noise 

If  {Xt}  ~  IID(0,  a2),  then  p(h)  =0  for  \h\  >  0,  so  from  (2.4.10)  we  obtain 

(l  iff  =7, 

Wij  =  i 

1 0  otherwise. 

For  large  n ,  therefore,  p(l), . . . ,  p(h)  are  approximately  independent  and  identically 
distributed  normal  random  variables  with  mean  0  and  variance  n~l .  This  result  is  the 
basis  for  the  test  that  data  are  generated  from  iid  noise  using  the  sample  ACF  described 
in  Section  1.6.  (See  also  Example  1.4.6.) 

□ 


An  MA(1)  Process 

If  PQ}  is  the  MA(1)  process  of  Example  1.4.4,  i.e.,  if 


Xt  =  Zt  +  9Zt_u  t  =  0,±1, 


where  {Zt}  ~  WN(0,  o'2),  then  from  (2.4.10) 


J  1  —  3p2(l)  +  4p4(l),  iff  =  1, 
j  1  +  2p2(l),  iff  >  1, 


is  the  approximate  variance  of  n~l/2(p{i)  —  p(i ))  for  large  n.  In  Figure  2-1  we  have 
plotted  the  sample  autocorrelation  function  p(k),  k  =  0,  . . . ,  40,  for  200  observations 
from  the  MA(1)  model 


Xt  =  Zt  —  .8Zf_!,  (2.4.11) 

where  { Zt }  is  a  sequence  of  iid  N(0,  1)  random  variables.  Here  p(  1)  =  —0.8/1.64  = 
—0.4878  and  p(h)  =  0  for  h  >  1.  The  lag-one  sample  ACF  is  found  to  be  p(  1)  = 
—0.4333  =  —6. 128/i— 1/2 ,  which  would  cause  us  (in  the  absence  of  our  prior  knowledge 
of  P6})  to  reject  the  hypothesis  that  the  data  are  a  sample  from  an  iid  noise  sequence. 
The  fact  that  \p(h)\<\.96n~l/2  for  h= 2,  . . . ,  40  strongly  suggests  that  the  data  are 
from  a  model  in  which  observations  are  uncorrelated  past  lag  1.  Figure  2-1  shows 
the  bounds  =bl.96n_  1/2 (1  +  2/r(l))1/2,  indicating  the  compatibility  of  the  data  with 
the  model  (2.4.11).  Since,  however,  p{\)  is  not  normally  known  in  advance,  the 
autocorrelations  p( 2), . . . ,  p(40)  would  in  practice  have  been  compared  with  the  more 


54 


Chapter  2  Stationary  Processes 


Figure  2-1 

The  sample  autocorrelation 
function  of  n  =  200 
observations  of  the  MA(1 ) 
process  in  Example  2.4.3, 
showing  the  bounds 

d=1.96n-'|/2('|+2p2(1))1/2 


Example  2.4.4 


0  10  20  30  40 

Lag 


stringent  bounds  ±1.96 n~l/2  or  with  the  bounds  ±1.96n_1/2(l±2p2(l))1/2  in  order 
to  check  the  hypothesis  that  the  data  are  generated  by  a  moving-average  process 
of  order  1.  Finally,  it  is  worth  noting  that  the  lag-one  correlation  —0.4878  is  well 
inside  the  95  %  confidence  bounds  for  p(l)  given  by  p(l)  ±  1.96^_1/2(1  —  3p2(l)  + 
4/34(l))1/2  =  —0.4333  ±  0.1053.  This  further  supports  the  compatibility  of  the  data 
with  the  model  Xt  =  Zt  —  0.8Zr_i. 

□ 


An  AR(1)  Process 

For  the  AR(1)  process  of  Example  2.2.1, 

Xt  —  <$>Xt-\  +  zr, 

where  { Zt }  is  iid  noise  and  |0|  <  1,  we  have,  from  (2.4.10)  with  p(h)  =  0|/z|, 

i  oo 

wu  =  -  O2  +  E  -  <t>f 

k=  1  k=i+ 1 

=  (1  -  </)2')(l  +</>2)(l  -  (I)2)'1  -  2i<p2i,  (2.4.12) 

i  =  1,  2,  ....  In  Figure  2-2  we  have  plotted  the  sample  ACF  of  the  Lake  Huron 
residuals  y\,  from  Figure  1-10  together  with  95%  confidence  bounds  for 

p(0,  i  —  1,  . . . ,  40,  assuming  that  data  are  generated  from  the  AR(1)  model 

Yt  —  0.791Fr_i  +  Zt  (2.4.13) 

[see  equation  (1.4.3)].  The  confidence  bounds  are  computed  from  p(i)  ±  1.96 n~i/2 
w-/2,  where  wu  is  given  in  (2.4.12)  with  0  =  0.791.  The  model  ACF,  p(i)  = 
(0.791)^,  is  also  plotted  in  Figure  2-2.  Notice  that  the  model  ACF  just  touches 
the  confidence  bounds  at  lags  2-4.  This  suggests  some  incompatibility  of  the  data  with 
the  model  (2.4.13).  A  much  better  fit  to  the  residuals  is  provided  by  the  second-order 
autoregression  defined  by  (1.4.4). 

□ 


2.5  Forecasting  Stationary  Time  Series 


55 


Figure  2-2 

The  sample  autocorrelation 
function  of  the  Lake  Huron 
residuals  of  Figure  1  -1 0 
showing  the  bounds 

p(/')=t1 .96/i-1/2 wV2  and 
the  model  ACF 
p(i)  =  (0.791  )7 


2.5  Forecasting  Stationary  Time  Series 

We  now  consider  the  problem  of  predicting  the  values  Xn+h,  h  >  0,  of  a  stationary 
time  series  with  known  mean  \i  and  autocovariance  function  y  in  terms  of  the 
values  { Xn ,  . . . ,  X{\,  up  to  time  n.  Our  goal  is  to  find  the  linear  combination  of 
1,  Xn,  Xn-i, . . . ,  X\ ,  that  forecasts  Xn+h  with  minimum  mean  squared  error.  The  best 
linear  predictor  in  terms  of  1,  Xn,  . . . ,  X\  will  be  denoted  by  PnXn+h  and  clearly  has 
the  form 

P nXn+h  —  ao  +  a\Xn  +  •  •  •  +  anX\.  (2.5.1) 

It  remains  only  to  determine  the  coefficients  ao,  a\,  . . . ,  an,  by  finding  the  values  that 
minimize 


S(a0,  . . . ,  On)  =  E(Xn+h  -  ao  -  a{Xn - anX i)2.  (2.5.2) 

(We  already  know  from  Problem  1 . 1  that  P^Y  —  E{Y).)  Since  S  is  a  quadratic  function 
of  ao,  . . . ,  an  and  is  bounded  below  by  zero,  it  is  clear  that  there  is  at  least  one  value  of 
(ao,  ,  an)  that  minimizes  S  and  that  the  minimum  (ao,  ,  an)  satisfies  the  equations 

35(ao,  •  •  • ,  a^ 


9  aj 


=  0,  j  =  0,  . . . ,  n. 


(2.5.3) 


Evaluation  of  the  derivatives  in  equation  (2.5.3)  gives  the  equivalent  equations 


n 


Xn-\-h  ao  aiXn+\-[ 

i=  1 


=  0. 


n 


(Xn+h  ao  atXn+  \  _  i  ^)Xnjr  j  _j 

i=l 


=  o,  j  =  1,  . . . ,  n. 


(2.5.4) 


(2.5.5) 


These  equations  can  be  written  more  neatly  in  vector  notation  as 


n 


ao 


=  ^  1 -J2 


i=  1 


(2.5.6) 
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and 

r„a„  =  7 n(h),  (2.5.7) 

where 

an  =  (ah  an)',  Tn  =  [y(i  y=|  , 
and 

Inih)  =  (y(/z),  y(/i  +  1), . . . ,  y(/z  +  n  -  1))'. 


Hence, 


n 


PnX 


n+h 


ii  +  'Y^ai(xn+i-i  -  n), 


i=  1 


(2.5.8) 


where  an  satisfies  (2.5.7).  From  (2.5.8)  the  expected  value  of  the  prediction  error 
Xn+h  —  PnXn+h  is  zero,  and  the  mean  square  prediction  error  is  therefore 


n  n  n 

E(Xn+h  -  P„Xn+h )2  =  y(  0)  -  2  ^  aiy{h  +  i  -  1)  +  a>Y(i  ~  j)aj 

i=  1  i=l  7=1 


=  y(0)  -a^y„(/i), 


(2.5.9) 


where  the  last  line  follows  from  (2.5.7). 


Remark  1.  To  show  that  equations  (2.5.4)  and  (2.5.5)  determine  PnXn+h  uniquely, 
let  [af\  j  =  0,  . . . ,  n]  and  [af  \  j  =  0, . . . ,  n]  be  two  solutions  and  let  Z  be  the 
difference  between  the  corresponding  predictors,  i.e., 

Z  =  C  -„«2>  +  Y,  (<■]"  -«?)*,. »-r 

j=  1 

Then 

z2  =  z  U"  -  a®  +  £  („<■>  -  „j2>)  x»+ w j  . 

But  from  (2.5.4)  and  (2.5.5)  we  have  £Z  =  0  and  E(ZXn+i_j)  =  0  for  j  =  1, . . . ,  n. 
Consequently,  E(Z2)  =  0  and  hence  Z  =  0.  □ 


Properties  of  PnXn+h : 


1. 

2. 

3. 

4. 


PnXn+h  —  lt~\~ y  — i  (Xn^- 1  —i  [i ) ,  where  a;;  —  (ci\ , 
E(Xn+h  -  P„Xn+h )2  =  y( 0)  -  a', t „(/i),  where  7„(/i) 
1))'. 


E(X„+i,  —  P  nXn+k)  —  0. 

E\ (X„+h  -  PnXn+h)Xj\  =  0 ,j  = 


. ,  an)'  satisfies  (2.5.7). 
(y(/z),  . . . ,  y(h  +  n  - 


Remark  2.  Notice  that  properties  3  and  4  are  exactly  equivalent  to  (2.5.4)  and  (2.5.5). 
They  can  be  written  more  succinctly  in  the  form 
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Example  2.5.1 


E\  (Error)  x  (PredictorVariable)]  =  0. 


(2.5.10) 


The  equations  (2.5.10),  one  for  each  predictor  variable,  therefore  uniquely  determine 


PnX 


n-\-h' 


□ 


One-Step  Prediction  of  an  AR(1)  Series 

Consider  now  the  stationary  time  series  defined  in  Example  2.2. 1  by  the  equations 


X,  =  4>Xt-x  +  Zf,  f  =  0,±l, 


5 


where  |0|  <  1  and  { Zt }  ~  WN(0,  a2).  From  (2.5.7)  and  (2.5.8),  the  best  linear 
predictor  of  Xn+\  in  terms  of  {1,  Xn, . . . ,  X{\  is  (for  n  >  1) 


P  nXn-\- 1  —  ^■nXn, 
where  Xn  =  (Xn,  . . . ,  X\)r  and 


n 

1 _ 

•••  0”-1 

a\ 

(p 

<t>  i  <t> 

•  •  • 

•  ••  4)n-2 

•  • 

a2 

• 

— 

<P2 

• 

•  •  • 

•  •  • 

_4>n~l  4 bn ~2  cj)n- 3 

•  • 

•  • 

...  i 

• 

• 

_^n_ 

• 

• 

J)n_ 

A  solution  of  (2.5.11)  is  clearly 


(2.5.11) 


(0,0,...,  oy 


and  hence  the  best  linear  predictor  of  Xn+\  in  terms  of  {X\, . . . ,  Xn]  is 
PfiXn+l  — 

with  mean  squared  error 

a2 

E(Xn+l  -  PnXn+i)2  =  y( 0)  -  a'7„(l)  =  - - —  -  <fiy(  1)  =  a2. 

1  ~  (pZ 

A  simpler  approach  to  this  problem  is  to  guess,  by  inspection  of  the  equation  defining 
Xn+i,  that  the  best  predictor  is  <fiXn.  Then  to  verify  this  conjecture,  it  suffices  to  check 
(2.5.10)  for  each  of  the  predictor  variables  1,  Xn, . . . ,  X\.  The  prediction  error  of  the 
predictor  <pXn  is  clearly  Xn+i  —  <pXn  =  Zn+ But  E(Zn+\Y)  =  0  for  Y  =  1  and  for 
Y  —  Xj,  j  =  1, . . . ,  n.  Hence,  by  (2.5.10),  (j)Xn  is  the  required  best  linear  predictor 
in  terms  of  1,  X\, . . . ,  Xn. 

□ 


2.5.1  Prediction  of  Second-Order  Random  Variables 

Suppose  now  that  Y  and  Wn,  ...,  W\  are  any  random  variables  with  finite  second 
moments  and  that  the  means  /z  =  EY ,  /z/  =  EWt  and  covariances  Cov(F,  F), 
Cov(F,  Wi),  and  Co v(W/,  Wj)  are  all  known.  It  is  convenient  to  introduce  the  random 
vector  W  =  (Wn, . . . ,  the  corresponding  vector  of  means  fiw  =  (/zw,  . . . ,  /z i)r, 
the  vector  of  covariances 

7  =  Cov(F,  W)  =  (Cov(F,  Wn),  Co v(F,  Wn-x), . . . ,  Cov(F,  W\))\ 
and  the  covariance  matrix 

r  =  cov(w,  w)  =  [CoV(wn+1_,,  wn+l-j)]lj=l . 
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Example  2.5.2 


Then  by  the  same  arguments  used  in  the  calculation  of  PnXn+h ,  the  best  linear  predictor 
of  F  in  terms  of  {1,  Wn,  . . . ,  W\}  is  found  to  be 

P(F|W)  =  fiY  +  a'(W  -  /lw),  (2.5.12) 

where  a  =  (a\, ... ,  an)'  is  any  solution  of 

Ta  =  7.  (2.5.13) 

The  mean  squared  error  of  the  predictor  is 

E  [(Y  -  P(Y \  W))2]  =  Var(r)  -  a'7.  (2.5. 14) 


Estimation  of  a  Missing  Value 

Consider  again  the  stationary  series  defined  in  Example  2.2.1  by  the  equations 


X,  =  0X,_!+Z„  t  =  0,±1, 


where  |0|  <  1  and  {Zr}  ~  WN(0,  a2).  Suppose  that  we  observe  the  series  at  times  1 
and  3  and  wish  to  use  these  observations  to  find  the  linear  combination  of  1 ,  X\ ,  and  X 3 
that  estimates  X2  with  minimum  mean  squared  error.  The  solution  to  this  problem  can 
be  obtained  directly  from  (2.5.12)  and  (2.5.13)  by  setting  Y  =  X 2  and  W  =  (X\,  X3)'. 
This  gives  the  equations 


0 

0 


with  solution 


a  = 


1 

1+02 


The  best  estimator  of  X2  is  thus 

p(*2|W)  =  — f—iX.+X,), 

1  +  4>z 

with  mean  squared  error 


E[(X2  -  PQCIW))2] 


(f)<J2 
1  -  </>2 
(f)<J2 
1  -  (p2 


1+02' 


□ 


2.5.2  The  Prediction  Operator  P(-|W) 

For  any  given  W  =  (Wn, . . . ,  W\)f  and  Y  with  finite  second  moments,  we  have  seen 
how  to  compute  the  best  linear  predictor  P(F|W)  of  Y  in  terms  of  1,  Wn,  . . . ,  W\ 
from  (2.5.12)  and  (2.5.13).  The  function  P(-|W),  which  converts  Y  into  P(F|W), 
is  called  a  prediction  operator.  (The  operator  Pn  defined  by  equations  (2.5.7)  and 
(2.5.8)  is  an  example  with  W  =  (Xn,Xn-i,  . . . ,  Xi)'.)  Prediction  operators  have  a 
number  of  useful  properties  that  can  sometimes  be  used  to  simplify  the  calculation  of 
best  linear  predictors.  We  list  some  of  these  below. 
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Example  2.5.3 


Example  2.5.4 


Properties  of  the  Prediction  Operator  P(  -|  W): 

Suppose  that  EU2  <  oo,  EV2  <  oo,  T  =  Cov(W,  W),  and  0,  oi\, . . . ,  an  are 
constants. 

1.  P(f/|W)  =  EU  +  a'(W  -  EW),  where  Ta  =  Co v(£/,  W). 

2.  E[(U  -  P(t/|W))W]  =  0  and  E[U  -  P([/|W)]  =  0. 

3.  E[(U  -  P(t/|W))2]  =  Var(C)  -  a'Co v(U,  W). 

4.  P(aiU  +  a2V  +  /3|W)  =  cnP(U\W)  +  a2P(C|W)  +  0. 

5.  P&l i  otiWi  +  0\W)  =  EL,  «iWi  +  0. 

6.  P(U\ W)  =  EU  if  Cov(U,  W)  =  0. 

7.  P(t/|W)  =  P(P(U\W,  V)|W)  if  V  is  a  random  vector  such  that  the  compo¬ 
nents  of  E(V\')  are  all  finite. 


One-Step  Prediction  of  an  AR (p)  Series 

Suppose  now  that  {X^}  is  a  stationary  time  series  satisfying  the  equations 

Xt  =  01^-1  +  •  •  •  +  (j)pXt-p  +  Zt,  t  —  0,  d=l,  . . . , 

where  {Zt}  ~  WN(0,  cr2)  and  Zt  is  uncorrelated  with  Xs  for  each  s  <  t.  Then  if 
n  >  p,  we  can  apply  the  prediction  operator  Pn  to  each  side  of  the  defining  equations, 
using  properties  (4),  (5),  and  (6)  to  get 

PfiXn+l  —  01  Xn  T  *  *  *  T  4*pXn-\-\—p' 

□ 


An  AR(1)  Series  with  Nonzero  Mean 

The  time  series  {Yt}  is  said  to  be  an  AR(1)  process  with  mean  /x  if  [Xt  =  Yt  —  /x}  is  a 
zero-mean  AR(1)  process.  Defining  {Xr}  as  in  Example  2.5.1  and  letting  Yt  =  Xt  +  /x, 
we  see  that  Yt  satisfies  the  equation 

Yt  -  /X  =  0(y,_i  -  /x)  +  Zt.  (2.5.15) 

If  PnYn+h  is  the  best  linear  predictor  of  Yn+h  in  terms  of  {1,  Yn, . . . ,  Y{\,  then  appli¬ 
cation  of  Pn  to  (2.5.15)  with  t  =  n  +  l,n  +  2,  ...  gives  the  recursions 

PnYn+h  P  —  0  (Pn^n+h—  1  /^)»  ^  —  1,2,.... 

Noting  that  =  Fn,  we  can  solve  these  equations  recursively  for  PnYn+h , 

/z  =  1,  2,  . . .,  to  obtain 

PnYn+h  =  U+  <t>\Yn  -  +)•  (2.5.16) 

The  corresponding  mean  squared  error  is  [from  (2.5.14)] 

E(Yn+h  -  PnYn+hf  =  y (0) [  1  -  a >„(/»].  (2.5.17) 

From  Example  2.2.1  we  know  that  y  (0)  =  a2/ (l  —(f)2)  and  p(h)  =  (ph ,  h  >  0.  Hence, 
substituting  a„  =  {d>h,  0,  . . . ,  OF  [from  (2.5.16)]  into  (2.5.17)  gives 

E(Yn+h  -  P„Yn+h)2  =  cr2(l  -  (f>2h)/(l  -  +).  (2.5.18) 

□ 
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Remark  3.  In  general,  if  {Yt}  is  a  stationary  time  series  with  mean  /x  and  if  {Xt }  is 
the  zero-mean  series  defined  by  Xt  =  Yt  —  \±,  then  since  the  collection  of  all  linear 
combinations  of  1,  Yn, . . . ,  Y\  is  the  same  as  the  collection  of  all  linear  combinations  of 
1 ,  Xn ,  . . . ,  X\ ,  the  linear  predictor  of  any  random  variable  W  in  terms  of  1 ,  Yn, ,  Y\ 
is  the  same  as  the  linear  predictor  in  terms  of  1,  Xn,  . . . ,  X\.  Denoting  this  predictor  by 
PnW  and  applying  Pn  to  the  equation  Yn+h  —  Xn+h  +  /x  gives 


PnXn+h  —  M  T  PnXn-\-h  • 


(2.5.19) 


Thus  the  best  linear  predictor  of  Yn+h  can  be  determined  by  finding  the  best  linear 
predictor  of  Xn+h  and  then  adding  /x.  Note  from  (2.5.8)  that  since  E(Xt )  =  0,  PnXn+h 
is  the  same  as  the  best  linear  predictor  of  Xn+h  in  terms  of  Xn ,  . . . ,  X\  only.  □ 


2.5.3  The  Durbin-Levinson  Algorithm 


In  view  of  Remark  3  above,  we  can  restrict  attention  from  now  on  to  zero-mean 
stationary  time  series,  making  the  necessary  adjustments  for  the  mean  if  we  wish  to 
predict  a  stationary  series  with  nonzero  mean.  If  {Xr}  is  a  zero-mean  stationary  series 
with  autocovariance  function  y  (•),  then  in  principle  the  equations  (2.5. 12)  and  (2.5. 13) 
completely  solve  the  problem  of  determining  the  best  linear  predictor  PnXn+h  of  Xn+h 
in  terms  of  { Xn ,  . . . ,  X\}.  However,  the  direct  approach  requires  the  determination 
of  a  solution  of  a  system  of  n  linear  equations,  which  for  large  n  may  be  difficult 
and  time-consuming.  In  cases  where  the  process  is  defined  by  a  system  of  linear 
equations  (as  in  Examples  2.5.2  and  2.5.3)  we  have  seen  how  the  linearity  of  Pn  can 
be  used  to  great  advantage.  For  more  general  stationary  processes  it  would  be  helpful 
if  the  one-step  predictor  PnXn+\  based  on  n  previous  observations  could  be  used  to 
simplify  the  calculation  of  Pn+\Xn+2i  the  one-step  predictor  based  on  n  +  1  previous 
observations.  Prediction  algorithms  that  utilize  this  idea  are  said  to  be  recursive.  Two 
important  examples  are  the  Durbin-Levinson  algorithm,  discussed  in  this  section,  and 
the  innovations  algorithm,  discussed  in  Section  2.5.4  below. 

We  know  from  (2.5.12)  and  (2.5.13)  that  if  the  matrix  is  nonsingular,  then 


EnXn+ 1  —  cPnXn  —  (f)n\Xn  (pnnX\ 


where 


TW’ 

yn  =  (y  ( 1),  . . . ,  y(n))\  and  the  corresponding  mean  squared  error  is 
vn  :=  E(Xn+ 1  -  PnXn+ 1)2  =  Y  (0)  -  (j)'n 7„. 

A  useful  sufficient  condition  for  nonsingularity  of  all  the  autocovariance  matrices 
ri,r2,...  is  y(0)  >  0  and  y(h)  ->  0  as  h  — ►  oo.  (For  a  proof  of  this  result  see 
Brockwell  and  Davis  (1991),  Proposition  5.1.1.) 
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The  Durbin-Levinson  Algorithm: 

The  coefficients  <pn ,  </w  can  be  computed  recursively  from  the  equations 


n—  1 

yin)  -  ^(pn-ijyin  -  j) 
i= i 


V 


-1 
n—  1  ’ 


(2.5.20) 


Pn—l,n—l 

Pn— 1,1 

and 

Vn  =  V„_i[l  -<p2nn], 

where  <f>u  =  y(l)/y(0)  and  vq  =  y (0). 


(2.5.21) 


(2.5.22) 


Proofs  1  The  definition  of  </>n  ensures  that  the  equation 


Rn<Pn  =  Pn  (2.5.23) 

(where  pn  —  (p(l),  . . . ,  pin))')  is  satisfied  for  n  —  1.  The  first  step  in  the  proof  is  to 
show  that  <pn ,  defined  recursively  by  (2.5.20)  and  (2.5.21),  satisfies  (2.5.23)  for  all  n. 
Suppose  this  is  true  for  n  =  k.  Then,  partitioning  /?*+ 1  and  defining 

p[r)  :=(p(k),p(k-  l),...,p(l))' 

and 


( <Pkk >  Pk,k- 1’  •  •  •  >  0M)7 


we  see  that  the  recursions  imply 


R^+10^+1 


R k  pV~ 

rr\- 

4>k  —  <Pk+i,k+i4>k 

pP  i  J 

<t>k+\,k+i 

Pk  ~  <l>k+l,k+lPk  +  0*+U+10* 

_Pk  'Pk  ~  Pk+l,k+lPk  'Pk  +  0£+U+l_ 


—  P&+1’ 

as  required.  Here  we  have  used  the  fact  that  if  Rkpk  =  Ph  then  RkPk  —  p\  •  This  is 
easily  checked  by  writing  out  the  component  equations  in  reverse  order.  Since  (2.5.23) 
is  satisfied  for  n  —  1,  it  follows  by  induction  that  the  coefficient  vectors  (pn  defined 
recursively  by  (2.5.20)  and  (2.5.21)  satisfy  (2.5.23)  for  all  n. 

It  remains  only  to  establish  that  the  mean  squared  errors 

vn  :=  E(Xn+ 1  -  <p'nXn)2 

satisfy  vo  =  y  (0)  and  (2.5.22).  The  fact  that  vo  =  y  (0)  is  an  immediate  consequence 
of  the  definition  PoX\  *=  E(X\)  —  0.  Since  we  have  shown  that  p'nXn  is  the  best  linear 
predictor  of  Xn+\,  we  can  write,  from  (2.5.9)  and  (2.5.21), 

Vn  =  y(0)  -  (p'nYn  =  y(  0)  -  4>n-lln-  1  +  4>rm4>n-lln-l  ~  nnY  (n )■ 
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Applying  (2.5.9)  again  gives 

V„  =  V„_1  +  (/>„„  1  -  y(n))  , 

and  hence,  by  (2.5.20), 

v„  =  v„_!  -  <p2nn(y( 0)  -  =  V„_1  (1  -  4>ln)  . 


Remark  4.  Under  the  conditions  of  the  proposition,  the  function  defined  by  o'(O)  = 
1  and  a(n )  =  <pnn,  n  —  1,  2, . . .,  is  known  as  the  partial  autocorrelation  function 
(PACF)  of  {Xr},  discussed  further  in  Section  3.2.  Equation  (2.5.22)  shows  the  relation 
between  a(n )  and  the  reduction  in  the  one-step  mean  squared  error  as  the  number  of 
predictors  is  increased  from  n  —  1  to  n.  □ 


2.5.4  The  Innovations  Algorithm 

The  recursive  algorithm  to  be  discussed  in  this  section  is  applicable  to  all  series  with 
finite  second  moments,  regardless  of  whether  they  are  stationary  or  not.  Its  application, 
however,  can  be  simplified  in  certain  special  cases. 

Suppose  then  that  {XJ  is  a  zero-mean  series  with  E\Xt\ 2  <  oo  for  each  t  and 


EiXtXj)  =  k  (/,  j). 


(2.5.24) 


We  denote  the  best  one-step  linear  predictors  and  their  mean  squared  errors  by 


and 


if  /?  =  1 , 
if  n  —  2,  3,  ... , 


v 


n 


—  E(Xn+l  —  PnXn+ 1) 


2 


We  shall  also  introduce  the  innovations,  or  one-step  prediction  errors, 

=  (U\, ,  Un)f  and  Xn  =  (X\ ,  . . . ,  Xn )'  the  last  equations 

(2.5.25) 

0 
1 

a2\ 

^n—  l,n—2 

(If  {Xr}  is  stationary,  then  atj  —  —aj  with  aj  as  in  (2.5.7)  with  h  —  I.)  This  implies  that 
An  is  nonsingular,  with  inverse  Cn  of  the  form 


1 

0 

0 

...  o 

0n 

l 

0 

...  0 

022 

• 

021 

• 

i 

• 

...  0 

• 

• 

• 

• 

• 

• 

0 

_@n—  \  ,n—\ 

@n—\,n—2 

@n—l,n—3 

...  1 

0  •••  0 

0  ...  0 

1  ...  0 


dyi—  l,n— 3 


Un  —  Xn  Xn. 

In  terms  of  the  vectors  \Jn 
can  be  written  as 

U72  —  AnXn, 

where  has  the  form 


An  — 


I 

an 

a22 


_^n—\  ,n—  \ 
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The  vector  of  one-step  predictors  Xn  :=  (X\,  P 1X2, . . . ,  Pn- \Xn)'  can  therefore  be 
expressed  as 


where 


0 

0 

#21 


0 

0 

0 


n—\,n—\ 


@n—l,n—2  @n—l,n—3 


and  Xn  itself  satisfies 


0 

0 

0 

0 

0 


Equation  (2.5.26)  can  be  rewritten  as 


if  n  =  0, 

if  n  —  1,2, , 


(2.5.26) 


(2.5.27) 


(2.5.28) 


from  which  the  one-step  predictors  X\,  X2,  . . .  can  be  computed  recursively  once 
the  coefficients  Oq  have  been  determined.  The  following  algorithm  generates  these 

coefficients  and  the  mean  squared  errors  V;  =  E(Xi+\  —  %+ 1)2,  starting  from  the 
covariances  K{i,  j). 


The  Innovations  Algorithm: 

The  coefficients  6n \,  ... ,  9nn  can  be  computed  recursively  from  the  equations 


and 


v0  =  k(  1,  1), 

0n,n-k  —  Vk  1  I  ic(n  +  1,  k  +  1) 


n —  1 

vn  =  K(n  +  1,  n  +  1)  -  en,n-jvJ- 

j= 0 


0  <  k  <  n, 


(It  is  a  trivial  matter  to  solve  first  for  vo,  then  successively  for  9\\,v\,922, 
@21,  V2;  #32,  #31,  V3;  .  .  .  .) 


Proof  See  Brockwell  and  Davis  (1991),  Proposition  5.2.2. 


Remark  5.  While  the  Durbin-Levinson  recursion  gives  the  coefficients  ofXn,  . . . ,  X\ 
in  the  representation  Xn+i  =  1  (t)njXn+ 1  -/,  the  innovations  algorithm  gives  the 

coefficients  of  (Xn  —  Xn),...,  (X\  —  Zi),  in  the  alternative  expansion  Xn+\  = 

Ylj= 1  Qnj(Xn+i-j  ~  Xn+i-j)-  The  latter  expansion  has  a  number  of  advantages  deriving 
from  the  fact  that  the  innovations  are  uncorrelated  (see  Problem  2.20).  It  can  also  be 
greatly  simplified  in  the  case  of  ARMA (p,  q )  series,  as  we  shall  see  in  Section  3.3. 
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Example  2.5.5 


An  immediate  consequence  of  (2.5.28)  is  the  innovations  representation  of  Xn+i  itself. 
Thus  (defining  6n o  :=  1), 


n 


Xn-\- 1  —  Xn+\  Xn-\-\  Xn+\  —  ^  ^  0nj 


n+l-j 


—j  Xnjr\—j ),  n  —  0,1,2,  .... 


j= o 


□ 


Recursive  Prediction  of  an  MA(1) 

If  {X,}  is  the  time  series  defined  by 

xt  =  Zt  +  0Zt-U  {Z,}  ~  WN  (0,  a2) , 

then/r(z,  j)  =  Ofor  \i—j\  >  1  ,/c(z,  /)  =  ct2(1+02),  and/c(z,  z+1)  =  da2.  Application 
of  the  innovations  algorithm  leads  at  once  to  the  recursions 

0nj  -  o,  2  <  j  <  n, 

0n\  =  V~\da2, 

v0  =  (1  +02)o2, 


and 

v„  =  [l  +  62  -  v~n\e2a2]  a2. 
For  the  particular  case 


Xt  =  Zt-  0.9 Z,_lf  {Zt}  -  WN(0,  1), 

the  mean  squared  errors  vn  of  Xn+i  and  coefficients  6nj,  1  <  j  <  n,  in  the  innovations 
representation 


are  found  from  the  recursions  to  be  as  follows: 


vo  =  1.8100, 
0n  =  -0.4972, 

02i  —  —0.6606, 

03!  =  -0.7404, 
04!  =  — 0.7870, 


vi  =  1.3625, 
022  =  0, 

032  =  0, 

042  =  0, 


v2  =  1.2155, 

033  =  0, 

043  =  0, 


v3  =  1.1436, 

044  —  0,  V4  =  1.1017. 


If  we  apply  the  Durbin-Levinson  algorithm  to  the  same  problem,  we  find  that  the 

✓V 

mean  squared  errors  vn  of  Xn+\  and  coefficients  0/2/-,  1  <  j  <  n,  in  the  representation 


n 


Xn-\~  \  —  ^  ^  (pnjXn-\-  \  —j 
7=1 

are  as  follows: 


vq  =  1.8100, 
0!!  =  -0.4972, 

02i  —  —0.6606, 

03!  =  —0.7404, 
04!  =  —0.7870, 


vi  =  1.3625, 
022  =  —0.3285, 
032  —  —0.4892, 
042  —  —0.5828, 


v2  =  1.2155, 

033  =  —0.2433,  V3  =  1.1436, 

043  =  —0.3850,  044  =  —0.1914,  V4 


1.1017. 


2.5  Forecasting  Stationary  Time  Series 
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Notice  that  as  n  increases,  vn  approaches  the  white  noise  variance  and  6n\  approaches  6. 
These  results  hold  for  any  MA(1)  process  with  \0\  <  1.  The  innovations  algorithm 
is  particularly  well  suited  to  forecasting  MA(g)  processes,  since  for  them  0nj  —  0 
for  n  —  j  >  q.  For  AR (p)  processes  the  Durbin-Levinson  algorithm  is  usually  more 
convenient,  since  <pnj  =  0  for  n  —  j  >  p. 

□ 


2.5.5  Recursive  Calculation  of  the  h- Step  Predictors 

For  h- step  prediction  we  use  the  result 

Pn (,^n+k  Pn+k—  l-^-n+k)  —  0,  k  >  1. 

This  follows  from  (2.5.10)  and  the  fact  that 


E\(Xn-\-k  Pn+k—  1-^n+k  tyXn+j—  l]  —  0,  j  —  1,  .  .  .  ,  72. 


Hence, 


P  nA-n+h  —  PnPn+h—l^-n+h 

A 

-  PflXyi+h 

n+h—  1 


P n  (  ^  Qn+h-l,j  (%i 


n+h—j  Xfi+h—j 


7=1 


Applying  (2.5.29)  again  and  using  the  linearity  of  Pn  we  find  that 

n+h—  1 


P n^n+h  —  ^  ^  @n+h—  1  j 


n+h—j  X n+h—j  J  ? 


j=h 


(2.5.29) 


(2.5.30) 


where  the  coefficients  9nj  are  determined  as  before  by  the  innovations  algorithm. 
Moreover,  the  mean  squared  error  can  be  expressed  as 

E(Xn+h  -  PnXn+hf  =  EX2n+h  -  E(PnXn+h )2 

n+h— 1 

=  K(n  +  h,n  +  h)  -  ^  d2+h_hjvn+h^i.  (2.5.31) 

j—h 


2.5.6  Prediction  of  a  Stationary  Process  in  Terms  of  Infinitely 
Many  Past  Values 


It  is  often  useful,  when  many  past  observations  Xm, . . . ,  Xq,  X\ ,  ,Xn  (m  <  0) 
are  available,  to  evaluate  the  best  linear  predictor  of  Xn+h  in  terms  of  1,  Xm,  . . . ,  Xq, 
. . . ,  Xn.  This  predictor,  which  we  shall  denote  by  Pm,nXn+h,  can  easily  be  evaluated 
by  the  methods  described  above.  If  \m\  is  large,  this  predictor  can  be  approximated  by 
the  sometimes  more  easily  calculated  mean  square  limit 


PnA-n+h  — 


lim  PmnXl 

m—>  —  oo 


n+h  ■ 


We  shall  refer  to  Pn  as  the  prediction  operator  based  on  the  infinite  past,  { Xt , 
— oo  <  t  <  n}.  Analogously  we  shall  refer  to  Pn  as  the  prediction  operator  based 
on  the  finite  past,  {X\ , . . . ,  Xn}.  (Mean  square  convergence  of  random  variables  is 
discussed  in  Appendix  C.) 
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2.5.7  Determination  of  PnXn+h 


If  {Xn}  is  a  zero-mean  stationary  process  with  autocovariance  function  y  (•)  then,  just  as 
PnXn+h  is  characterized  by  equation  (2.5.10),  PnXn+h  is  characterized  by  the  equations 


n^n+h  I  ^n+l—i 


x,. 


=  0,  /  =  1,2,.... 


If  we  can  find  a  solution  to  these  equations,  it  will  necessarily  be  the  uniquely  defined 

/*w 

predictor  PnXn+h •  An  approach  to  this  problem  that  is  often  effective  is  to  assume  that 
PnXn+h  can  be  expressed  in  the  form 


oo 


PnX. 


n^n+h  —  ^  y  ^ jXn+ 1  —j  ? 

7=  1 

in  which  case  the  preceding  equations  reduce  to 


E 


or  equivalently, 


y(i  —  j)ocj  =  y(h  +  i  —  1),  i  —  1,2,.... 

7=1 


This  is  an  infinite  set  of  linear  equations  for  the  unknown  coefficients  that  determine 
PnXn+h,  provided  that  the  resulting  series  converges. 


Properties  of  Pn : 

Suppose  that  EU 2  <  oo,  EV 2  <  oo,  a ,  b ,  and  c  are  constants,  and  T  = 
Cov(W,  W). 

1.  E[(U-Pn(U))Xj]  =  0J<n. 

2.  Pn(aU  +  bV  +  c)  =  aPn{U)  +  bPn{V)  +  c. 

3.  Pn(U)  =  U  if  t/  is  a  limit  of  linear  combinations  of  Xj,  j  <  n. 

4.  Pn{U)  =  EU  if  Co v(t/,  XJ  =  0  for  all  j  <  n. 


These  properties  can  sometimes  be  used  to  simplify  the  calculation  of 
PnXn+h,  notably  when  the  process  {XJ  is  an  ARMA  process. 


Example  2.5.6  Consider  the  causal  invertible  ARMA(1,1)  process  {XJ  defined  by 


X,  -  4>Xt-x  =Zt  +  dZt-u  {Z,}  ~  wn(o,  a2). 

We  know  from  (2.3.3)  and  (2.3.5)  that  we  have  the  representations 

oo 

Xn+ 1  —  Zn+\  +  (0  +  0)  f/)-7  lZn+\_j 

7=1 


oo 

Z„+l  =  Xn+x  -(4>  +  9)  J2(-0y-lXn+i-j. 


and 


2.6  The  Wold  Decomposition 
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Applying  the  operator  Pn  to  the  second  equation  and  using  the  properties  of  Pn  gives 

oo 

pnxn+l  =  (<p  +  e)  Y.i-oy^Xn+i-j. 

7=1 


Applying  the  operator  Pn  to  the  first  equation  and  using  the  properties  of  Pn  gives 

oo 

PnXn+ 1  =  (</>  +  0)  </>7  lZn+\-j. 

7=1 


Hence, 


]  P 4-  1  - 


77^72+1 


'71+1 


and  so  the  mean  squared  error  of  the  predictor  PnXn+\  is  EZ2+l  =  a2. 


□ 


2.6  The  Wold  Decomposition 

Consider  the  stationary  process 

Xt  =  A  cos  (cot)  +  B  sin  (cot), 

where  co  e  (0, 7r)  is  constant  and  A,  B  are  uncorrelated  random  variables  with  mean  0 
and  variance  cr2.  Notice  that 

Xn  —  (2 cos  co)Xn— i  Xn—2  —  B n—  i Xn ?  vi  —  0,  d=l,  . . . , 

so  that  Xn  —  Pn_ \Xn  —  0  for  all  n.  Processes  with  the  latter  property  are  said  to  be 

deterministic. 


The  Wold  Decomposition: 

If  {Xr}  is  a  nondeterministic  stationary  time  series,  then 

oo 

V  =  E 

(2.6.1) 

7=0 

where 

1.  ^0  =  1  and  J2jlo  ff  <  °°> 

2.  {Z,}  -  WN(0,<r2), 

3.  Cov(Z*,  Vt)  —  0  for  all  s  and  t , 

4.  Zt  —  PtZt  for  all  t , 

5.  Vt  =  PsVt  for  all  s  and  t ,  and 

6.  {VJ  is  deterministic. 

Here  as  in  Section  2.5,  PtY  denotes  the  best  predictor  of  Y  in  terms  of  linear  com¬ 
binations,  or  limits  of  linear  combinations  of  1,XS,  —  oo  <  s  <  t.  The  sequences 
f Zt },  {xj/j},  and  {Vt}  are  unique  and  can  be  written  explicitly  as  Zt  =  Xt  —  Pt-\Xt , 
xf/j  =  E(XtZt_j) /E(Z2 ),  and  Vt  =  Xt  —  Yljl o  (See  Brockwell  and  Davis  (1991), 

p.  188.)  For  most  of  the  zero-mean  stationary  time  series  dealt  with  in  this  book 
(in  particular  for  all  ARMA  processes)  the  deterministic  component  Vt  is  0  for  all 
7,  and  the  series  is  then  said  to  be  purely  nondeterministic. 
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Example  2.6.1 


Problems 


If  Xt  =  U,  +  Y,  where  j U,}  ~  WN  (0,  v2),  E(UtY)  =  0  for  all  t,  and  Y  has  mean 

0  and  variance  r2,  then  Pt~\Xt  =  Y,  since  Y  is  the  mean  square  limit  as  ,v  — >  oo  of 
K-i  +  •  •  •  +  Xt_s]/s,  and  E[(X,  -  Y)XS]  =  0  for  all  s  <  t  —  1.  Hence  the  sequences 
in  the  Wold  decomposition  of  {XJ  are  given  by  Zt  —  Ut,  xj/o  —  1,  xj/j  =  0  for  j  >  0, 
and  =  Y. 

□ 


2.1  Suppose  that  X\,  X2,  . . is  a  stationary  time  series  with  mean  /z  and  ACF  p(-). 
Show  that  the  best  predictor  of  Xn+h  of  the  form  aXn  +  b  is  obtained  by  choosing 
a  —  p(h)  and  b  —  /z(l  —  p(h)). 

2.2  Show  that  the  process 

Xt  =  Acos(cot)  +  Bsin(cot),  t  =  0,  ±1,  . . . 


(where  A  and  B  are  uncorrelated  random  variables  with  mean  0  and  variance  1 
and  co  is  a  fixed  frequency  in  the  interval  [0,  tt ]),  is  stationary  and  find  its  mean 
and  autocovariance  function.  Deduce  that  the  function  k(K)  =  cos (coh),h  = 
0,  ±1,  . . is  nonnegative  definite. 

2.3  a.  Find  the  ACVF  of  the  time  series  Xt  =  Zt  +  0.3Zr_i  —  0.4Zr_2,  where  { Zt }  ~ 

WN(0,  1). 

b.  Find  the  ACVF  of  the  time  series  Yt  —  Zt—  l.2Zt_i  —  1.6Zr_2,  where  {Zt}  ~ 
WN(0,  0.25).  Compare  with  the  answer  found  in  (a). 

2.4  It  is  clear  that  the  function  k(K)  —  1,  h  =  0,  ±1,  . . . ,  is  an  autocovariance  func¬ 
tion,  since  it  is  the  autocovariance  function  of  the  process  Xt  =  Z,  t  =  0,  ±1, . . ., 
where  Z  is  a  random  variable  with  mean  0  and  variance  1.  By  identifying 
appropriate  sequences  of  random  variables,  show  that  the  following  functions 
are  also  autocovariance  functions: 


a.  kQi)  =  (-1)|/!| 


K  (h)  =  1  +  cos 


+  cos 


c. 


if  h  =  0, 
ifh  =  ±1, 
otherwise. 


2.5  Suppose  that  { Xt ,  t  =  0,  ±1, . . .}  is  stationary  and  that  \0\  <  1.  Show  that  for 
each  fixed  n  the  sequence 

0jXn-j 

7=1 

is  convergent  absolutely  and  in  mean  square  (see  Appendix  C)  as  m  00. 

2.6  Verify  the  equations  (2.2.6). 


2.6  The  Wold  Decomposition 
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2.7  Show,  using  the  geometric  series  1/(1  —  x)  =  E/=o  xJ  for  \x\  <  1,  that  1/(1  — 

0z)  =  —  i  for  |0|  >  1  and  |z|  >  1. 

2.8  Show  that  the  autoregressive  equations 


Xt  —  0 \Xt_\  +  Zt,  t  —  0,  ±1,  , 

where  {Zt}  ~  WN(0,  a2)  and  |0|  =  1,  have  no  stationary  solution.  HINT: 
Suppose  there  does  exist  a  stationary  solution  {XJ  and  use  the  autoregressive 
equation  to  derive  an  expression  for  the  variance  of  Xt  —  that  con¬ 

tradicts  the  stationarity  assumption. 

2.9  Let  {Yt}  be  the  AR(1)  plus  noise  time  series  defined  by 


Yt  =  Xt  +  Wu 

where  {Wt}  ~  WN(0,  cr2),  {XJ  is  the  AR(1)  process  of  Example  2.2.1,  i.e., 

X,  -  =  Z„  (Ztj  ~  WN  (0,  a2) , 

and  E(WsZt )  =  0  for  all  s  and  t. 

a.  Show  that  {FJ  is  stationary  and  find  its  autocovariance  function. 

b.  Show  that  the  time  series  Ut  :=  Y,  —  0F,_i  is  1 -correlated  and  hence,  by 
Proposition  2.1.1,  is  an  MA(1)  process. 

c.  Conclude  from  (b)  that  {FJ  is  an  ARMA(1,1)  process  and  express  the 
three  parameters  of  this  model  in  terms  of  0,  a2,  and  a: . 

2.10  Use  the  program  ITSM  to  compute  the  coefficients  0)  and  7Tj,j=  1,  . . . ,  5,  in 
the  expansions 

oo 

v  =  £  tjZt-j 
j= 0 

and 

oo 

zt  =  ^  TTjXt-j 

j= o 

for  the  ARM A(  1,1)  process  defined  by  the  equations 

Xt  -  0.5Xr_!  =  Zt  +  0.5Z,_1?  {ZJ  -  WN  (0,  a2) . 

(Select  File>Pro j  ect >New>Univariate,  then  Model>Specify. 
In  the  resulting  dialog  box  enter  1  for  the  AR  and  MA  orders,  specify 
0(1)  =  0(1)  =  0.5,  and  click  OK.  Finally,  select  Model >AR/MA 

Inf  inity>Def  ault  lag  and  the  values  of  0)  and  itj  will  appear  on  the 
screen.)  Check  the  results  with  those  obtained  in  Section  2.3. 

2.11  Suppose  that  in  a  sample  of  size  100  from  an  AR(1)  process  with  mean  /z,  0  =  .6, 
and  a2  =  2  we  obtain  xioo  =  0.271.  Construct  an  approximate  95  %  confidence 
interval  for  /z.  Are  the  data  compatible  with  the  hypothesis  that  /z  =  0? 

2.12  Suppose  that  in  a  sample  of  size  100  from  an  MA(1)  process  with  mean  /z, 
6  —  —0.6,  and  a2  =  1  we  obtain  xioo  =  0.157.  Construct  an  approximate 
95  %  confidence  interval  for  /z.  Are  the  data  compatible  with  the  hypothesis  that 

[i  =  0? 

2.13  Suppose  that  in  a  sample  of  size  100,  we  obtain  p(l)  =  0.438  and  p( 2)  =  0. 145. 
a.  Assuming  that  the  data  were  generated  from  an  AR(1)  model,  construct 

approximate  95  %  confidence  intervals  for  both  p  (1)  and  p  (2).  Based  on  these 
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two  confidence  intervals,  are  the  data  consistent  with  an  AR(1)  model  with 

0  =  0.8? 

b.  Assuming  that  the  data  were  generated  from  an  MA(1)  model,  construct 
approximate  95%  confidence  intervals  for  both  p(l)  and  p(2).  Based  on  these 
two  confidence  intervals,  are  the  data  consistent  with  an  MA(1)  model  with 

e  =  o.6? 

2.14  Let  {YJ  be  the  process  defined  in  Problem  2.2. 

a.  Find  P\X2  and  its  mean  squared  error. 

b.  Find  P2X3  and  its  mean  squared  error. 

c.  Find  PnXn+\  and  its  mean  squared  error. 

2.15  Suppose  that  { Xt ,  t  —  0,  ±1, . . .}  is  a  stationary  process  satisfying  the  equations 

Xt  —  <p \Xt_\  +  •  •  •  +  <fipXt_p  +  zu 

where  {Zt}  ~  WN(0,  a2)  and  Zt  is  uncorrelated  with  Xs  for  each  s  <  t.  Show 
that  the  best  linear  predictor  PnXn+ 1  of  Xn+\  in  terms  of  1,  Yi,  . . . ,  Xn ,  assuming 
n  >  p,  is 

PnXn-\-\  —  (pp^n+l—p- 

What  is  the  mean  squared  error  of  PnXn+\l 

2.16  Use  the  program  ITSM  to  plot  the  sample  ACF  and  PACF  up  to  lag  40  of  the 
sunspot  series  Dt ,  t  —  1,  100,  contained  in  the  ITSM  file  SUNSPOTS. TSM. 
(Open  the  project  SUNSPOTS. TSM  and  click  on  the  second  yellow  button  at  the 
top  of  the  screen  to  see  the  graphs.  Repeated  clicking  on  this  button  will  toggle 
between  graphs  of  the  sample  ACF,  sample  PACF,  and  both.  To  see  the  numerical 
values,  right-click  on  the  graph  and  select  Info.)  Fit  an  AR(2)  model  to  the 
mean-corrected  data  by  selecting  Model>Estimation>Preliminary  and 
click  Yes  to  subtract  the  sample  mean  from  the  data.  In  the  dialog  box  that 
follows,  enter  2  for  the  AR  order  and  make  sure  that  the  MA  order  is  zero  and  that 
the  Yule  -  Walker  algorithm  is  selected  without  AICC  minimization.  Click  OK 
and  you  will  obtain  a  model  of  the  form 

Xr  =  </>,Xr_,  +  </>2Xf_ 2  +  Z„  where  {Z,}  -  WN  (0,  a2)  , 

for  the  mean-corrected  series  Xt  =  Dt— 46.93.  Record  the  values  of  the  estimated 
parameters  0 1,  02,  and  a2.  Compare  the  model  and  sample  ACF  and  PACF  by 
selecting  the  third  yellow  button  at  the  top  of  the  screen.  Print  the  graphs  by 
right-clicking  and  selecting  Print. 

2.17  Without  exiting  from  ITSM,  use  the  model  found  in  the  preceding  problem  to 
compute  forecasts  of  the  next  ten  values  of  the  sunspot  series.  (Select  Fore¬ 
cast  ing>ARMA,  make  sure  that  the  number  of  forecasts  is  set  to  10  and  the  box 
Add  the  mean  to  the  forecasts  is  checked,  and  then  click  OK.  You 
will  see  a  graph  of  the  original  data  with  the  ten  forecasts  appended.  Right-click 
on  the  graph  and  then  on  Inf  o  to  get  the  numerical  values  of  the  forecasts.  Print 
the  graph  as  described  in  Problem  2.16.)  The  details  of  the  calculations  will  be 
taken  up  in  Chapter  3  when  we  discuss  ARMA  models  in  detail. 

2.18  Let  {YJ  be  the  stationary  process  defined  by  the  equations 

Xt  =  Zt  —  6Zt_  1,  t  —  0,  ±1, 

where  \6  \  <  1  and  { Zt }  ~  WN (0,  a2) .  Show  that  the  best  linear  predictor  PnXn+ 1 
of  Y,z+i  based  on  { Xj ,  —00  <j<  n}  is 
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oo 


PnZn+l  ~  ~  Y0JXn^j. 

7=1 


What  is  the  mean  squared  error  of  the  predictor  PnXn+ 1? 

2.19  If  [Xt]  is  defined  as  in  Problem  2.18  and  0  =  1,  find  the  best  linear  predictor 
PnXn+ 1  of  X/?+ 1  in  terms  of  Xi , . . . ,  Xn.  What  is  the  corresponding  mean  squared 
error? 

A 

2.20  In  the  innovations  algorithm,  show  that  for  each  n  >  2,  the  innovation  Xn  —  Xn 

A 

is  uncorrelated  with  X\, ... ,  Xn_\.  Conclude  that  Xn  —  Xn  is  uncorrelated  with 

/V  A 

the  innovations  X\  —  X\,  ... ,  Xn_\  —  Xn_ \ . 

2.21  LetXi,  X2,  X4,  X5  be  observations  from  the  MA(1)  model 

x,  =  zt  +  ezt_x ,  {Zt}  -  WN  (0,  a2)  . 

a.  Find  the  best  linear  estimate  of  the  missing  value  X3  in  terms  of  X\  and  X2. 

b.  Find  the  best  linear  estimate  of  the  missing  value  X3  in  terms  of  X4  and  X5. 

c.  Find  the  best  linear  estimate  of  the  missing  value  X3  in  terms  of  X\,  X2,  X4 , 
mdXs. 

d.  Compute  the  mean  squared  errors  for  each  of  the  estimates  in  (a)-(c). 

2.22  Repeat  parts  (a)-(d)  of  Problem  2.21  assuming  now  that  the  observations  X\,  X2, 
X4,  X5  are  from  the  causal  AR(1)  model 

Xt  =  +  Z,,  [Z,}  ~  WN  (0,  cr2)  . 


ARMA  Models 


3.1  ARMA(/?,  q)  Processes 

3.2  The  ACF  and  PACF  of  an  ARMA(p,  q)  Process 

3.3  Forecasting  ARMA  Processes 


In  this  chapter  we  introduce  an  important  parametric  family  of  stationary  time 
series,  the  autoregressive  moving-average,  or  ARMA,  processes.  For  a  large  class  of 
autocovariance  functions  y  (•)  it  is  possible  to  find  an  ARMA  process  {Xr}  with  ACVF 
Yxi  )  such  that  y  (•)  is  well  approximated  by  yx(0-  In  particular,  for  any  positive  integer 
K,  there  exists  an  ARMA  process  {X/}  such  that  yxih)  —  y(h)  for  h  —  0,  1,  . . . ,  K. 
For  this  (and  other)  reasons,  the  family  of  ARMA  processes  plays  a  key  role  in  the 
modeling  of  time  series  data.  The  linear  structure  of  ARMA  processes  also  leads  to  a 
substantial  simplification  of  the  general  methods  for  linear  prediction  discussed  earlier 
in  Section  2.5. 


3.1  ARMA (p,  q)  Processes 

In  Section  2.3  we  introduced  an  ARMA(I,I)  process  and  discussed  some  of  its  key 
properties.  These  included  existence  and  uniqueness  of  stationary  solutions  of  the 
defining  equations  and  the  concepts  of  causality  and  invertibility.  In  this  section  we 
extend  these  notions  to  the  general  ARMA (p,  q)  process. 
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Definition  3.1.1 


{Xt}  is  an  ARMA(p,  q)  process  if  PQ}  is  stationary  and  if  for  every  t , 

Xt  —  —  ...  —  (ppXt—p  =  Zt  +  OiZt-i  +  *  *  *  +  OqZt—q,  (3.1.1) 

where  {Zt}  ~  WN  (0,  a2)  and  the  polynomials  (l  —  0iz  —  ...  —  <fipzp)  and  (l  + 
0iz  +  . . .  +  6qzq )  have  no  common  factors. 


The  process  {Xr}  is  said  to  be  an  ARMA(p,  q )  process  with  mean  /x  if  {Xt  —  p] 
is  an  ARMA(p,  q )  process. 

It  is  convenient  to  use  the  more  concise  form  of  (3.1.1) 

</>(B)Xt  =  0(B)Zt ,  (3.1.2) 

where  </>(•)  and  0(-)  are  the  pth  and  gth-degree  polynomials 

0P)  =  1  -  0lZ - </)pZP 

and 


0(z)  —  1  +  0lZ  +  •  •  •  +  OqZq , 

and  5  is  the  backward  shift  operator  (2?pQ  =  Xr_y,  Z?JZr  =  Zr_y,  j  =  0,  ±1, . . .). 
The  time  series  {Ar}  is  said  to  be  an  autoregressive  process  of  order  p  (or  AR (p))  if 
0(z)  =  1,  and  a  moving-average  process  of  order  q  (or  MA(g))  if  0(z)  =  1. 

An  important  part  of  Definition  3.1.1  is  the  requirement  that  {Xr}  be  stationary. 
In  Section  2.3  we  showed,  for  the  ARM A(  1,1)  equations  (2.3.1),  that  a  stationary 
solution  exists  (and  is  unique)  if  and  only  if  0i  ^  ±1.  The  latter  is  equivalent  to 
the  condition  that  the  autoregressive  polynomial  0(z)  =  1  —  <p\z  ^  0  for  z  =  ±1.  The 
analogous  condition  for  the  general  ARMA (/?,  q )  process  is  0  (z)  =  1  —  <fi\Z  —  •  •  •  — 
(ppzp  0  for  all  complex  z  with  |z|  =  1.  (Complex  z  is  used  here,  since  the  zeros  of  a 
polynomial  of  degree  p  >  1  may  be  either  real  or  complex.  The  region  defined  by  the 
set  of  complex  z  such  that  |z|  =  1  is  referred  to  as  the  unit  circle.)  If  </>(z)  ^  0  for  all  z 
on  the  unit  circle,  then  there  exists  8  >  0  such  that 


1 

4>(z) 


oo 


=  E 


XjZJ  for  1  —  8  <  |z|  <  1  +  8 


]=- oo 


and  £,=-00  \Xj\  <  oo.  We  can  then  define  1/0 (B)  as  the  linear  filter  with  absolutely 
summable  coefficients 


1 

HB) 


Applying  the  operator  x(B)  •=  1/00 B)  to  both  sides  of  (3.1.2),  we  obtain 

oo 

V  =  /  (B)<p (B)X,  =  x  (B)0 ( B)Z t  =  ir (B)Z,  =  ^  (3.1 .3) 

j=-0 O 


where  0(z)  =  x(z)0(z)  =  Yljl- oo  Using  the  argument  given  in  Section  2.3  for 

the  ARM A(  1,1)  process,  it  follows  that  ^r{B)Zt  is  the  unique  stationary  solution  of 
(3.1.1). 


3.1  ARMA(p,  q)  Processes 
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Existence  and  Uniqueness: 

A  stationary  solution  {Xr}  of  equation  (3.1.1)  exists  (and  is  also  the  unique  sta¬ 
tionary  solution)  if  and  only  if 


<P(z)  =  1-01  z - 4>Pzp  +  0 


for  all  |z  —  1. 


(3.1.4) 


In  Section  2.3  we  saw  that  the  ARMA(1,1)  process  is  causal,  i.e.,  that  Xt  can  be 
expressed  in  terms  of  Zs,  s  <  t,  if  and  only  if  |0i|  <  1.  For  a  general  ARMA(p,  q) 
process  the  analogous  condition  is  that  0(z)  /  0  for  |z|  <  1,  i.e.,  the  zeros  of  the 
autoregressive  polynomial  must  all  be  greater  than  1  in  absolute  value. 


Causality: 


An  ARMA(p,  q)  process  {Xr}  is  causal,  or  a  causal  function  of  {Zt},  if  there 
exist  constants  {0,}  such  that  =olVol  <  oo  and 

oo 

X,  =  J2  tjZt-j  for  all  T.  (3.1 .5) 

j= o 


Causality  is  equivalent  to  the  condition 


0(z)  =  1  -  0iz  -  •  •  •  -  <fipzp  /  0  for  all  |z 


(3.1.6) 


The  proof  of  the  equivalence  between  causality  and  (3.1.6)  follows  from  elemen¬ 
tary  properties  of  power  series.  From  (3.1.3)  we  see  that  {Xr}  is  causal  if  and  only  if 
/  (z)  :=  1/0 (z)  =  Yljl o  Xj^  (assuming  that  0(z)  and  6(z)  have  no  common  factors). 
But  this,  in  turn,  is  equivalent  to  (3.1.6). 

The  sequence  (0)}  in  (3.1.5)  is  determined  by  the  relation  0(z)  =  Yljlo  0 j ^  — 
0(z)/0(z),  or  equivalently  by  the  identity 

(l  -  0i z  -  •  •  •  -  (j)pzp )  (0o  +  0i z  +•••)  =  1  +  0\z+ - F  0qzq. 

Equating  coefficients  of  zKj  =  0,  1,  . . .,  we  find  that 

1  =  00, 

^1  =  01  —  0001, 

6>2  =  02  -  0101  -  0002, 


or  equivalently, 

p 

0/  -  I>0/-,  =  0j,  j  =  0,  1, ... ,  (3.1.7) 

fc=i 

where  0O  •=  1,  $/  *=  0  for  j  >  q ,  and  0y  :=  0  for  j  <  0. 

Invertibility,  which  allows  Z?  to  be  expressed  in  terms  of  X*  ,  s  <  t,  has  a  similar 
characterization  in  terms  of  the  moving-average  polynomial. 
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Example  3.1.1 


Example  3.1.2 


Invertibility: 

An  ARMA (p,  q)  process  {Xr}  is  invertible  if  there  exist  constants  { tt7 }  such  that 

Yljl o  I nj\  <  00  and 

00 

Zt  —  ^  7tj  Xt_j  for  all  t. 
j= 0 


Invertibility  is  equivalent  to  the  condition 

0(z)  =  1  +  9\z  ~\~  •  •  •  +  9qZq  7^  0  for  all  \z 


Interchanging  the  roles  of  the  AR  and  MA  polynomials,  we  find  from  (3.1.7)  that 
the  sequence  {  tt7  }  is  determined  by  the  equations 

7tj  +  0k7Zj-k  =  -(j)j,  j  =  0,  1, . . . ,  (3.1.8) 

k=l 

where  </>o  :=  —1,  4>j  :=  0  for  j  >  p ,  and  itj  \=  0  for  j  <  0. 

An  ARMA(1,1)  Process  Consider  the  ARMA(1,1)  process  {Xr}  satisfying  the  equa¬ 
tions 

X,  -  0.5Xr_!  =  Z,  +  0.4 Z,_!,  {Z,}  ~  WN  (0,  a2)  .  (3.1.9) 

Since  the  autoregressive  polynomial  <fi(z)  =  1  —  0.5z  has  a  zero  at  z  =  2,  which  is 
located  outside  the  unit  circle,  we  conclude  from  (3.1.4)  and  (3.1.6)  that  there  exists 
a  unique  ARMA  process  satisfying  (3.1.9)  that  is  also  causal.  The  coefficients  (V')}  in 
the  MA(oo)  representation  of  {X/}  are  found  directly  from  (3.1.7): 

fo  =  1, 

rfr  i  =  0.4  +  0.5, 

f  2  =  0.5(0.4  4-0.5), 

fj  =  0.5J-1(0.4  +  0.5),  j  =  1,2, ... . 

The  MA  polynomial  0(z)  =  1  +  OAz  has  a  zero  at  z  =  —  1  /0.4  =  —2.5,  which  is  also 
located  outside  the  unit  circle.  This  implies  that  {X^}  is  invertible  with  coefficients  {nj} 
given  by  [see  (3.1.8)] 

7To  =  1, 

7t\  —  —(0.4  +  0.5), 

7t2  =  —(0.4  +  0.5)(— 0.4), 

jtj  =  -(0.4  +  0.5)  (—0.4V-1,  7  =  1,2,.... 

(A  direct  derivation  of  these  formulas  for  {x//j}  and  {jtj}  was  given  in  Section  2.3  without 
appealing  to  the  recursions  (3.1.7)  and  (3.1.8).) 

□ 


An  AR(2)  Process 

Let  {X/}  be  the  AR(2)  process 

X,  =  0.7Xf_!  -  0. ixf_2  +  Z„  {Z,}  ~  WN  (0,  a2) . 


3.2  The  ACF  and  PACF  of  an  ARMA(p,  q)  Process 
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The  autoregressive  polynomial  for  this  process  has  the  factorization  0(z)  =  1  —  0.7z+ 
0.1  z2  =  (1  —  0.5z)(l  —  0.2z),  and  is  therefore  zero  at  z  =  2  and  z  =  5.  Since  these 
zeros  lie  outside  the  unit  circle,  we  conclude  that  {Xt}  is  a  causal  AR(2)  process  with 
coefficients  {i/p}  given  by 

V'O  =  1, 

tyi  =  0.7, 

xfr 2  =  0.72  -0.1, 

xj/j  =  0.7x[fj-i  -  O.h/fj-2,  j-  2,3, - 

While  it  is  a  simple  matter  to  calculate  0)  numerically  for  any  j ,  it  is  possible  also 
to  give  an  explicit  solution  of  these  difference  equations  using  the  theory  of  linear 
difference  equations  (see  Brockwell  and  Davis  (1991),  Section  3.6). 

□ 

The  option  Model  >Speci  fy  of  the  program  ITSM  allows  the  entry  of  any  causal 
ARMA(p,  q )  model  with  p  <  28  and  q  <  28.  This  option  contains  a  causality  check 
and  will  immediately  let  you  know  if  the  entered  model  is  noncausal.  (A  causal  model 
can  be  obtained  by  setting  all  the  AR  coefficients  equal  to  0.001.)  Once  a  causal  model 
has  been  entered,  the  coefficients  0)  in  the  MA(oo)  representation  of  the  process  can  be 
computed  by  selecting  Model >AR/MA  Infinity.  This  option  will  also  compute 
the  AR(oo)  coefficients  itj,  provided  that  the  model  is  invertible. 

Example  3.1 .3  An  ARMA(2,1)  Process 

Consider  the  ARMA(2,1)  process  defined  by  the  equations 

X,  -  0.75X;_!  +  0.5625X;_2  =  Z,  +  1.25 Zf_1;  {Z,}  ~  WN  (0,  a2). 

The  AR  polynomial  (j>(z)  =  1  —  0.75z  +  0.5625z2  has  zeros  at  z  =  2  (l  ±  i-J 3)/3, 
which  lie  outside  the  unit  circle.  The  process  is  therefore  causal.  On  the  other  hand, 
the  MA  polynomial  0(z)  =  1  +  1.25 z  has  a  zero  at  z  =  —0.8,  and  hence  {X/}  is  not 
invertible. 

□ 

Remark  1.  It  should  be  noted  that  causality  and  invertibility  are  properties  not  of  {Xr} 
alone,  but  rather  of  the  relationship  between  the  two  processes  {Xr}  and  {Zt}  appearing 
in  the  defining  ARM  A  equations  (3.1.1).  □ 

Remark  2.  If  {Xt}  is  an  ARMA  process  defined  by  (p(B)Xt  =  9(B)Zt ,  where  0(z)  ^  0 
if  |z|  =  I,  then  it  is  always  possible  (see  Brockwell  and  Davis  (1991),  p.  127)  to  find 
polynomials  <fi  (z)  and  0  (z)  and  a  white  noise  sequence  { Wt }  such  that  <fi  (. B)Xt  =  6  (. B )  Wt 
and  6  (z)  and  0  (z)  are  nonzero  for  |z|  <  1 .  However,  if  the  original  white  noise  sequence 
{Zt}  is  iid,  then  the  new  white  noise  sequence  will  not  be  iid  unless  {Zt}  is  Gaussian. 

□ 

In  view  of  Remark  2,  we  will  focus  our  attention  principally  on  causal  and 
invertible  ARMA  processes. 


3.2  The  ACF  and  PACF  of  an  ARMA(p,  q)  Process 

In  this  section  we  discuss  three  methods  for  computing  the  autocovariance  function 
y(-)  of  a  causal  ARMA  process  {Xf}.  The  autocorrelation  function  is  readily  found 
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from  the  ACVF  on  dividing  by  7(0).  The  partial  autocorrelation  function  (PACF)  is 
also  found  from  the  function  7(-). 


3.2.1  Calculation  of  the  ACVF 


First  we  determine  the  ACVF  y  (•)  of  the  causal  ARMA (/?,  q )  process  defined  by 

4>(B)Xt  =  6 (B)Zt,  {Z,}  ~  WN  (0,  a2),  (3.2.1) 

where  0(z)  —  l  —  <fi\z  —  •  •  •  —  <ppzp  and  0(z)  =  1  +  0\z  +  •  •  •  +  0qzq.  The  causality 
assumption  implies  that 

(X) 

x,  =  J2  (3-2.2) 

j= o 


where  =  0(z)/<p(z), 

discussed  in  Section  3.1. 


<  1.  The  calculation  of  the  sequence  {1/7}  was 


First  Method.  From  Proposition  2.2.1  and  the  representation  (3.2.2),  we  obtain 

oo 

Y  (h)  =  E(Xt+hXt)  Mj+\h\ ■  (3-2.3) 

7=0 


The  ARM A(  1,1)  Process 

Substituting  from  (2.3.3)  into  (3.2.3),  we  find  that  the  ACVF  of  the  process  defined  by 
Xt-<t>Xt-x  =z,  +  ez,-u  {Z,}  ~  WN  (0,  a2)  ,  (3.2.4) 

with  |0|  <  1  is  given  by 


oo 


Y  (0)  = 

j= o 


=  cr 


oo 


1  +  (9  +  4>)2  J2  <t>2i 

j= 0 


=  cr 


21 


1  + 


(0+4>) 


1  -<p2  _ 


oo 


Y(  1)  =  a2J>;+i^ 

7=0 


=  a 


oo 


e  +  4>  +  (9  +  (j))2<t)  J2  <t>2j 

7=0 


and 


=  a 


2±1 


e  +  </>  + 


W  +  W4> 

l  -  cp  - 


y(h)  =  4>h~1y(l),  h  >  2. 


□ 


3.2  The  ACF  and  PACF  of  an  ARMA(p,  q)  Process 
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Example  3.2.3 


The  MA(g)  Process 


For  the  process 

xt  —  z,+  0^  +  •  •  •  +  eqzt_q,  (Ztj  ~  WN  (o,  ff2) , 

Equation  (3.2.3)  immediately  gives  the  result 


y(h) 


f  q-\h\ 

°2  E  0jOj+\h\,  ii\h\<q, 
j= o 


0,  if  \h\  >  q, 


where  do  is  defined  to  be  1.  The  ACVF  of  the  MA(g)  process  thus  has  the  distinctive 
feature  of  vanishing  at  lags  greater  than  q.  Data  for  which  the  sample  ACVF  is 
small  for  lags  greater  than  q  therefore  suggest  that  an  appropriate  model  might  be  a 
moving  average  of  order  q  (or  less).  Recall  from  Proposition  2.1.1  that  every  zero-mean 
stationary  process  with  correlations  vanishing  at  lags  greater  than  q  can  be  represented 
as  a  moving-average  process  of  order  q  or  less. 

□ 

Second  Method.  If  we  multiply  each  side  of  the  equations 


Xt  (fiiXf—i  •  •  •  (ppXf—p 


—  Zt  +  d\Zt—\  +  *  *  *  +  QqZt_q , 


by  Xt_k,  k  =  0,  1,  2,  ... ,  and  take  expectations  on  each  side,  we  find  that 

oo 

y(k)-4>iy(k-l) - <PPy(k-p)  =  o2  ^  9k+jfj,  0  <k  <  m, 

j= 0 


and 


(3.2.5) 


y (k)  —  (f)\Y (k  —  1)  —  •  •  •  —  ( ppY(k  —  p)  —  0,  k  >  m,  (3.2.6) 

where  m  =  max(p,  q  +  1),  xf/j  :=  0  for  j  <  0,  6o  ■—  1,  and  6j  :=  0  for  j  £  {0,  . . . ,  q}. 
In  calculating  the  right-hand  side  of  (3.2.5)  we  have  made  use  of  the  expansion  (3.2.2). 
Equations  (3.2.6)  are  a  set  of  homogeneous  linear  difference  equations  with  constant 
coefficients,  for  which  the  solution  is  well  known  (see,  e.g.,  Brockwell  and  Davis 
(1991),  Section  3.6)  to  be  of  the  form 

Y  00  =  a\%\h  +  oi2^2h  3 - b  aphph,  h>m-p ,  (3.2.7) 

where  £i, . . . ,  t=p  are  the  roots  (assumed  to  be  distinct)  of  the  equation  <p(z)  —  0,  and 
a\,  ...  ,ap  are  arbitrary  constants.  (For  further  details,  and  for  the  treatment  of  the 
case  where  the  roots  are  not  distinct,  see  Brockwell  and  Davis  (1991),  Section  3.6.) 
Of  course,  we  are  looking  for  the  solution  of  (3.2.6)  that  also  satisfies  (3.2.5).  We 
therefore  substitute  the  solution  (3.2.7)  into  (3.2.5)  to  obtain  a  set  of  m  linear  equations 
that  then  uniquely  determine  the  constants  oq,  . . . ,  ap  and  the  m  —  p  autocovariances 
y(h),  o  <  h  <  m  —  p. 

The  ARM A(  1,1)  Process 

For  the  causal  ARMA(1,1)  process  defined  in  Example  3.2.1,  equations  (3.2.5)  are 
K(0)-</>y(l)  =  ff2(l+  0(0  +  </>))  (3.2.8) 


and 


y(l)-(l>y(0)  =  a2d. 


(3.2.9) 
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Equation  (3.2.6)  takes  the  form 

y(k)  -  (k  -  1)  =  0,  k  >  2.  (3.2.10) 

The  solution  of  (3.2.10)  is 

y  (h)  =  a(\)]\  h  >  1 . 

Substituting  this  expression  for  y(h)  into  the  two  preceding  equations  (3.2.8)  and 
(3.2.9)  gives  two  linear  equations  for  a  and  the  unknown  autocovariance  y  (0).  These 
equations  are  easily  solved,  giving  the  autocovariances  already  found  for  this  process 
in  Example  3.2.1. 

□ 


Example  3.2.4 


The  General  AR(2)  Process 


For  the  causal  AR(2)  process  defined  by 

(l-$yB)(l-fr1B)Xt  =  Zt,  |$iUfcl  > 


we  easily  find  from  (3.2.7)  and  (3.2.5)  using  the  relations 

01  —  1  +  1 
and 


that 


2h2 


y(h)  = 


(£l^2  —  1)(^2  —  £l) 


[(fr  -  i)-^yh  -  -  irTCT 


(3.2.11) 


Figures  3-1,  3-2,  3-3,  and  3-4  illustrate  some  of  the  possible  forms  of  y  (•)  for  different 
values  of  and  £2-  Notice  that  in  the  case  of  complex  conjugate  roots  £1  =  rel°  and 
£2  =  re~l° ,  0  <  9  <  7T,  we  can  write  (3.2.11)  in  the  more  illuminating  form 


y(h) 


cr2r4  •  r  h  sin (hO  +  0) 

(r2  —  1  )(r4  —  2 r2  cos  20  +  1)  sin0 


(3.2.12) 


Figure  3-1 

The  model  ACF  of  the  AR(2) 
series  of  Example  3.2.4  with 
£l=2  and  ^2=5 


Lag 


3.2  The  ACF  and  PACF  of  an  ARMA(p,  q)  Process 
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Figure  3-2 

The  model  ACF  of  the  AR(2) 
series  of  Example  3.2.4 
with  £i  =1 0/9  and  §2=2 


Figure  3-3 

The  model  ACF  of  the  AR(2) 
series  of  Example  3.2.4  with 
^  =  -10/9  and  =2 
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where 

r2  +  1 

tar n/r  =  — - tan#  (3.2.13) 

r2  —  1 

and  cos  i//  has  the  same  sign  as  cos  6.  Thus  in  this  case  y  (•)  has  the  form  of  a  damped 
sinusoidal  function  with  damping  factor  r~l  and  period  In /0.  If  the  roots  are  close 
to  the  unit  circle,  then  r  is  close  to  1,  the  damping  is  slow,  and  we  obtain  a  nearly 
sinusoidal  autocovariance  function. 

□ 

Third  Method.  The  autocovariances  can  also  be  found  by  solving  the  first  p  +  1 
equations  of  (3.2.5)  and  (3.2.6)  for  y(0) . . . ,  y(p)  and  then  using  the  subsequent 

equations  to  solve  successively  for  y(p  +  1),  y(p  +  2), _ This  is  an  especially 

convenient  method  for  numerical  determination  of  the  autocovariances  y  (h)  and  is 
used  in  the  option  Model>ACF/PACF>Model  of  the  program  ITSM. 
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Figure  3-4 

The  model  ACF  of  the  AR(2) 
series  of  Example  3.2.4  with 
^  =2(1  +/V3)/3  and 
§2  =2(1  -/V 3)/3 

Example  3.2.5 
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Consider  again  the  causal  ARMA(1,1)  process  of  Example  3.2.1.  To  apply  the  third 
method  we  simply  solve  (3.2.8)  and  (3.2.9)  for  y(0)  and  y(  1).  Then  y( 2),  y( 3),  . . . 
can  be  found  successively  from  (3.2.10).  It  is  easy  to  check  that  this  procedure  gives 
the  same  results  as  those  obtained  in  Examples  3.2.1  and  3.2.3. 

□ 


3.2.2  The  Autocorrelation  Function 


Recall  that  the  ACF  of  an  ARMA  process  {Xr}  is  the  function  p(-)  found  immediately 
from  the  ACYF  y  (•)  as 


Y(h) 

y  (0) ' 


Likewise,  for  any  set  of  observations  [x\ , . . . ,  xn},  the  sample  ACF  p(-)  is  computed  as 


The  Sample  ACF  of  an  MA(q)  Series.  Given  observations  [x\,  ...  ,xn]  of  a  time 
series,  one  approach  to  the  fitting  of  a  model  to  the  data  is  to  match  the  sample  ACF 
of  the  data  with  the  ACF  of  the  model.  In  particular,  if  the  sample  ACF  p(h)  is  sig¬ 
nificantly  different  from  zero  for  0  <  h  <  q  and  negligible  for  h  >  q,  Example 
3.2.2  suggests  that  an  MA(g)  model  might  provide  a  good  representation  of  the  data. 
In  order  to  apply  this  criterion  we  need  to  take  into  account  the  random  variation 
expected  in  the  sample  autocorrelation  function  before  we  can  classify  ACF  values 
as  “negligible.”  To  resolve  this  problem  we  can  use  Bartlett’s  formula  (Section  2.4), 
which  implies  that  for  a  large  sample  of  size  n  from  an  MA(g)  process,  the  sample 
ACF  values  at  lags  h  greater  than  q  are  approximately  normally  distributed  with 
means  0  and  variances  Whh/n  =  (l  +  2p2(l)  +  •••  +  2 p2(q))/n.  This  means 
that  if  the  sample  is  from  an  MA(g)  process  and  if  h  >  q,  then  p(h)  should  fall 
between  the  bounds  ±1.96  with  probability  approximately  0.95.  In  practice 

we  frequently  use  the  more  stringent  values  ±1.96/V^  as  the  bounds  between  which 
sample  autocovariances  are  considered  “negligible.”  A  more  effective  and  systematic 
approach  to  the  problem  of  model  selection,  which  also  applies  to  ARMA (p,  q)  models 
with p  >  0  and  q  >  0,  will  be  discussed  in  Section  5.5. 


3.2  The  ACF  and  PACF  of  an  ARMA(p,  q)  Process 
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3.2.3  The  Partial  Autocorrelation  Function 

The  partial  autocorrelation  function  (PACF)  of  an  ARM  A  process  {Xr}  is  the 
function  c^-)  defined  by  the  equations 

or(0)  =  1 

and 


a{h)=(j)hh,  h  >  1, 
where  0/7/7  is  the  last  component  of 

<t>h  =  rvV*,  (3.2.14) 

r*  =  [y(i  -j)]hiJ=v  and  yh  =  \y(  1),  y( 2),  . . . ,  k(/i)]'. 

For  any  set  of  observations  {x\, . . . ,  xn)  with  xt  ^  Xj  for  some  i  and  j,  the  sample 
PACF  a  ( h )  is  given  by 

a(0)  =  1 


and 


/V 

a(h)  =  (f>hh,  h>\, 

✓v 

where  0/?/7  is  the  last  component  of 


(3.2.15) 


We  show  in  the  next  example  that  the  PACF  of  a  causal  AR (p)  process  is  zero  for 
lags  greater  than  p.  Both  sample  and  model  partial  autocorrelation  functions  can  be 
computed  numerically  using  the  program  ITSM.  Algebraic  calculation  of  the  PACF  is 
quite  complicated  except  when  q  is  zero  or  p  and  q  are  both  small. 

It  can  be  shown  (Brockwell  and  Davis  (1991),  p.  171)  that  0/7/7  is  the  correlation 
between  the  prediction  errors  Xh  —  P(Xh\X i,  . . . ,  X/7_i)  andXo  —  P(Xq\Xi,  . . . ,  Xh-\). 


Example  3.2.6  The  PACF  of  an  AR(p)  Process 

For  the  causal  AR (p)  process  defined  by 

X,  -  - <ppXt_p  =  Z„  {Zt}  ~  WN  (0,  a2), 

we  know  (Problem  2.15)  that  for  h  >  p  the  best  linear  predictor  of  Xh+\  in  terms  of 
1?  X\,  . . . ,  Xh  is 

A 

Xh+i  —  0i  Xh  +  faXh-i  +  •  •  •  +  (ppXh+i-p. 

Since  the  coefficient  0/7/7  of  X\  is  0/9  if  h  —  p  and  0  if  h  >  /?,  we  conclude  that  the 
PACF  a  (•)  of  the  process  {Xr}  has  the  properties 

a(p)  =  <t>p 
and 


a(h)  =  0  for  h  >  p. 

For  hep  the  values  of  a(h)  can  easily  be  computed  from  (3.2.14).  For  any 
specified  ARMA  model  the  PACF  can  be  evaluated  numerically  using  the  option 
Model>ACF/PACF>Model  of  the  program  ITSM. 

□ 
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Example  3.2.7 


Example  3.2.8 


Figure  3-5 

Time  series  of  the  overshorts 
in  Example  3.2.8 


The  PACF  of  an  MA(1)  Process 

For  the  MA(1)  process,  it  can  be  shown  from  (3.2. 14)  (see  Problem  3.12)  that  the  PACF 
at  lag  h  is 

«(h)  =  cj)hh  =  -(-e)h/(i  +  e2  +  ---  +  e2h). 

□ 

The  Sample  PACF  of  an  AR(p)  Series.  If  {Xr}  is  an  AR (p)  series,  then  the  sample 
PACF  based  on  observations  [x\,  . . . ,  xn)  should  reflect  (with  sampling  variation)  the 
properties  of  the  PACF  itself.  In  particular,  if  the  sample  PACF  a(h)  is  significantly 
different  from  zero  for  0  <  h  <  p  and  negligible  for  h  >  p.  Example  3.2.6  suggests 
that  an  AR (p)  model  might  provide  a  good  representation  of  the  data.  To  decide  what 
is  meant  by  “negligible”  we  can  use  the  result  that  for  an  AR (p)  process  the  sample 
PACF  values  at  lags  greater  than  p  are  approximately  independent  A^(0,  l/n)  random 
variables.  This  means  that  roughly  95  %  of  the  sample  PACF  values  beyond  lag  p 
should  fall  within  the  bounds  ±1.96 /*/n.  If  we  observe  a  sample  PACF  satisfying 
\&(h)\  >  1.96 /<Jn  for  0  <  h  <  p  and  \ot(h)\  <  1.96 /y/n  for  h  >  p ,  this  suggests 
an  AR (p)  model  for  the  data.  For  a  more  systematic  approach  to  model  selection,  see 
Section  5.5. 


3.2.4  Examples 

The  time  series  plotted  in  Figure  3-5  consists  of  57  consecutive  daily  overshorts  from 
an  underground  gasoline  tank  at  a  filling  station  in  Colorado.  If  yt  is  the  measured 
amount  of  fuel  in  the  tank  at  the  end  of  the  tth  day  and  at  is  the  measured  amount  sold 
minus  the  amount  delivered  during  the  course  of  the  tth  day,  then  the  overshort  at  the 
end  of  day  t  is  defined  as  xt  =  yt  —  yr_  i  ±  at.  Due  to  the  error  in  measuring  the  current 
amount  of  fuel  in  the  tank,  the  amount  sold,  and  the  amount  delivered  to  the  station,  we 
view  yt,  at ,  and  xt  as  observed  values  from  some  set  of  random  variables  Yt,  Au  and  Xt 
for  t  —  1,  . . . ,  57.  (In  the  absence  of  any  measurement  error  and  any  leak  in  the  tank, 
each  xt  would  be  zero.)  The  data  and  their  ACF  are  plotted  in  Figures  3-5  and  3-6.  To 

check  the  plausibility  of  an  MA(1)  model,  the  bounds  ±1.96  (l  ±  2/32(l))  1/2/«1/2  are 
also  plotted  in  Figure  3-6.  Since  p(h)  is  well  within  these  bounds  for  h  >  1,  the  data 


3.2  The  ACF  and  PACF  of  an  ARMA(p,  q)  Process 
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Figure  3-6 

The  sample  ACF  of  the  data 
in  Figure  3-5  showing  the 
bounds  ±1 .96n-1/2  (l  + 

2/52(1))1//2  assuming  an 
MA(1 )  model  for  the  data 
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appear  to  be  compatible  with  the  model 

X,  =  n  +  Z,  +  0Z,-u  {Z,}  ~  WN(0,a2).  (3.2.16) 

The  mean  /x  may  be  estimated  by  the  sample  mean  x57  =  —4.035,  and  the  parameters 
6,  a2  may  be  estimated  by  equating  the  sample  ACVF  with  the  model  ACVF  at  lags 
0  and  1,  and  solving  the  resulting  equations  for  6  and  a2.  This  estimation  procedure 
is  known  as  the  method  of  moments,  and  in  this  case  gives  the  equations 

(1  +  02)cr2  =  y  (0)  =  3415.72, 

Go2  =  y(  1)  =  -1719.95. 

Using  the  approximate  solution  6  =  —  1  and  a2  =  1708,  we  obtain  the  noninvertible 
MA(1)  model 

X,  =  -4.035  +Zt-  Zt-u  { Zt}  ~  WN(0,  1708). 

Typically,  in  time  series  modeling  we  have  little  or  no  knowledge  of  the  underlying 
physical  mechanism  generating  the  data,  and  the  choice  of  a  suitable  class  of  models 
is  entirely  data  driven.  For  the  time  series  of  overshorts,  the  data,  through  the  graph 
of  the  ACF,  lead  us  to  the  MA(1)  model.  Alternatively,  we  can  attempt  to  model 
the  mechanism  generating  the  time  series  of  overshorts  using  a  structural  model.  As 
we  will  see,  the  structural  model  formulation  leads  us  again  to  the  MA(1)  model.  In 
the  structural  model  setup,  write  Yt ,  the  observed  amount  of  fuel  in  the  tank  at  time  t,  as 

Yt  =  y*  +  Uu  (3.2.17) 

where  y*  is  the  true  (or  actual)  amount  of  fuel  in  the  tank  at  time  t  (not  to  be  confused 
with  yt  above)  and  Ut  is  the  resulting  measurement  error.  The  variable  y*  is  an  ide¬ 
alized  quantity  that  in  principle  cannot  be  observed  even  with  the  most  sophisticated 
measurement  devices.  Similarly,  we  assume  that 

At  =  a?  +  Vt,  (3.2.18) 

where  a*  is  the  actual  amount  of  fuel  sold  minus  the  actual  amount  delivered  during 
day  t,  and  Vt  is  the  associated  measurement  error.  We  further  assume  that  {Ut}  ~ 
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WN  (0,  cr^),  {Vr}  ~  WN  (0,  a y),  and  that  the  two  sequences  {Ut}  and  {Vr}  are  uncor¬ 
related  with  one  another  (. E(UtVs )  =  0  for  all  s  and  t).  If  the  change  of  level  per  day 
due  to  leakage  is  /z  gallons  (/z  <  0  indicates  leakage),  then 

y*  =  M  +  y*-i  -  a*-  (3.2.19) 

This  equation  relates  the  actual  amounts  of  fuel  in  the  tank  at  the  end  of  days  t  and 
t—  1,  adjusted  for  the  actual  amounts  that  have  been  sold  and  delivered  during  the  day. 
Using  (3.2.17)-(3.2.19),  the  model  for  the  time  series  of  overshorts  is  given  by 

Xt  —  Yt  —  Yt_x  +  A,  =  /z  +  Ut  -  Ut- 1  +  V,. 

This  model  is  stationary  and  1 -correlated,  since 

EXt  —  E(fi  +  Ut  —  Ut-\  +  Vt)  —  /z 

and 


y  (h)  =  E[(Xt+h  -  -  fi)] 


=  E[(Ut+h  -  Ut+h—i  +  Vt+h)(Ut  -  Ut—\  +  V,)] 


if  h  =  0, 
if  \h\  =  1, 


0,  otherwise. 


Example  3.2.9 


It  follows  from  Proposition  2.1.1  that  {Xr}  is  the  MA(1)  model  (3.2.16)  with 


From  this  equation  we  see  that  the  measurement  error  associated  with  the  adjustment 
{A;}  is  zero  (i.e.,  Gy  —  0)  if  and  only  if  p(  1)  =  —0.5  or,  equivalently,  if  and  only 
if  0\  =  —1.  From  the  analysis  above,  the  moment  estimator  of  6\  for  the  overshort 
data  is  in  fact  —1,  so  that  we  conclude  that  there  is  relatively  little  measurement  error 
associated  with  the  amount  of  fuel  sold  and  delivered. 

We  shall  return  to  a  more  general  discussion  of  structural  models  in  Chapter  8. 

□ 

The  Sunspot  Numbers 

Figure  3-7  shows  the  sample  PACF  of  the  sunspot  numbers  Si,  ... ,  Sioo  (for  the  years 
1770-1869)  as  obtained  from  ITSM  by  opening  the  project  SUNSPOTS. TSM  and 
clicking  on  the  second  yellow  button  at  the  top  of  the  screen.  The  graph  also  shows  the 
bounds  ±1.96/VlOO.  The  fact  that  all  of  the  PACF  values  beyond  lag  2  fall  within 
the  bounds  suggests  the  possible  suitability  of  an  AR(2)  model  for  the  mean-corrected 
data  set  Xt  =  St  —  46.93.  One  simple  way  to  estimate  the  parameters  </>i,  0 2,  and  a1 
of  such  a  model  is  to  require  that  the  ACVF  of  the  model  at  lags  0,  1,  and  2  should 
match  the  sample  ACVF  at  those  lags.  Substituting  the  sample  ACVF  values 


y( 0)  =  1382.2,  y(l)  =  1114.4,  y( 2)  =  591.73, 

for  y(0),  y(  1),  and  y(2)  in  the  first  three  equations  of  (3.2.5)  and  (3.2.6)  and  solving 
for  0i,  02,  and  g2  gives  the  fitted  model 

X,  -  1.31&Xf_!  +  0.634X,_2  =  Z„  {Ztj  -  WN(0,  289.2).  (3.2.20) 

(This  method  of  model  fitting  is  called  Yule- Walker  estimation  and  will  be  discussed 
more  fully  in  Section  5.1.1.) 

□ 


3.3  Forecasting  ARMA  Processes 
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3.3  Forecasting  ARMA  Processes 


The  innovations  algorithm  (see  Section  2.5.4)  provided  us  with  a  recursive  method  for 
forecasting  second-order  zero-mean  processes  that  are  not  necessarily  stationary.  For 
the  causal  ARMA  process 

<KB)X,  =  9{B)Z„  {Zt}  ~  WN  (0,  or2), 

it  is  possible  to  simplify  the  application  of  the  algorithm  drastically.  The  idea  is  to  apply 
it  not  to  the  process  {Xr}  itself,  but  to  the  transformed  process  [cf.  Ansley  (1979)] 


IWt  =  a  %,  t  =  1,  ...  ,m, 

Wt  =  a~l4>(B)Xt,  t  >  m, 

where 


(3.3.1) 


m  =  max(/?,  q). 


(3.3.2) 


For  notational  convenience  we  define  00  :=  1  and  9j  :=  0  for  j  >  q.  We  shall  also 
assume  that  p  >  1  and  q  >  1.  (There  is  no  loss  of  generality  in  these  assumptions, 
since  in  the  analysis  that  follows  we  may  take  any  of  the  coefficients  </>;  and  9t  to  be 
zero.) 

The  autocovariance  function  yx(  •)  of  {Xr}  can  easily  be  computed  using  any  of  the 
methods  described  in  Section  3.2.1.  The  autocovariances  /c(/,  j)  =  EiWiWj)  ,  u  j  >  i, 
are  then  found  from 


o'  2Yx(i~j), 


1  <  /,  j  <  m 


x  (i,  j ) 


G 


-2 


P 


YxU  -j)  -  E  QrYxir  -  I  i  -j  I) 


r=  1 


q 

@r@r+\i—j\i 

r= 0 


0 


5 


min (/',  /)  <  m  <  max (7,  j)  <  2m, 

(3.3.3) 

mind,  j )  >  III, 
otherwise. 
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Applying  the  innovations  algorithm  to  the  process  {WJ  we  obtain 

n 

yv  _  yv 

Wft+1  —  Wn-\-\—j) ■>  \  H  <i  JTl, 

Jt  (3.3.4) 

yv  __f__  y\ 

Wfi+l  —  @ni (.Wn+l—i  Wn-\-\— J?  72  ^  772, 

7=1 

where  the  coefficients  6nj  and  the  mean  squared  errors  rn  =  E  (wn+ 1  —  W^+i^)  are 

found  recursively  from  the  innovations  algorithm  with  k  defined  as  in  (3.3.3).  The 
notable  feature  of  the  predictors  (3.3.4)  is  the  vanishing  of  0nj  when  both  n  >  m  and 
j  >  q.  This  is  a  consequence  of  the  innovations  algorithm  and  the  fact  that  k(t,  s)  =  0 
if  r  >  m  and  |r  —  v|  >  q. 

Observe  now  that  the  equations  (3.3.1)  allow  each  Xn  ,n>  1,  to  be  written  as  a 
linear  combination  of  Wj,  1  <j<n ,  and,  conversely,  each  Wn,  n  >  1,  to  be  written  as 
a  linear  combination  ofXj,l<j<n.  This  means  that  the  best  linear  predictor  of  any 
random  variable  Y  in  terms  of  {1,  X\, ,  Xn)  is  the  same  as  the  best  linear  predictor 
of  Y  in  terms  of  {1,  W\, . . . ,  Wn).  We  shall  denote  this  predictor  by  PnY.  In  particular, 
the  one- step  predictors  of  Wn+\  and  Xn+\  are  given  by 

Wn+ 1  =  PnWn+l 

and 

Xn-\- 1  —  EnXyi- f- 1  • 


Using  the  linearity  of  Pn  and  the  equations  (3.3.1)  we  see  that 


which,  together  with  (3.3.1),  shows  that 

X,-Xt  =  a^Wt-W,  J  for  all  2  >1. 


(3.3.5) 


(3.3.6) 


Replacing  (Wj  —  Wj )  by  a  1  (Xj  —  Xy)  in  (3.3.3)  and  then  substituting  into  (3.3.4), 
we  finally  obtain 


1  <  n  <  m, 


n  >  m, 


(3.3.7) 


and 

E  (xn+l  -  C+i)2  =  <y2E  (wn+l  -  W„+1)2  =  cr2rn,  (3.3.8) 

where  6nj  and  rn  are  found  from  the  innovations  algorithm  with  k  as  in  (3.3.3). 

y\  yv 

Equations  (3.3.7)  determine  the  one-step  predictors  X2,  X3,  . . .  recursively. 


Remark  1.  It  can  be  shown  (see  Brockwell  and  Davis  (1991),  Problem  5.6)  that  if 
{XJ  is  invertible,  then  as  n  00, 
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Example  3.3.1 


Example  3.3.2 


Example  3.3.3 


and 


Algebraic  calculation  of  the  coefficients  9nj  and  rn  is  not  feasible  except  for  very  simple 
models,  such  as  those  considered  in  the  following  examples.  However,  numerical 
implementation  of  the  recursions  is  quite  straightforward  and  is  used  to  compute 
predictors  in  the  program  ITSM.  □ 

Prediction  of  an  AR (p)  Process 

Applying  (3.3.7)  to  the  ARMA (/?,  0)  process,  we  see  at  once  that 

A 

Xn+1  —  01  Xn  +  •  •  •  +  4*pXn+\—p’>  ft  P- 


□ 


Prediction  of  an  MA(g)  Process 

Applying  (3.3.7)  to  the  ARMA(1,  q )  process  with  0i  —  0  gives 


min  (n,q) 

Aft- 1-1  —  ^  ^  9nj 

7=1 

where  the  coefficients  6nj  are  found  by  applying  the  innovations  algorithm  to  the  co- 
variances  K(i,  j )  defined  in  (3.3.3).  Since  in  this  case  the  processes  {Xr}  and  {cr~[Wt} 
are  identical,  these  covariances  are  simply 


X-n+ 1  —j  A/?_^_  i  j 


n  >  1, 


q-\i-j\ 

K(i ,  j )  =  Cf~2yX(i  ~j)  =  Xj  °rer+\H\- 

r= 0 


Prediction  of  an  ARMA(1,1)  Process 
If 


□ 


X,  -  (j)Xt_x  =  Zt  +  OZt-u  {Z,}  ~  WN  (0,  a2), 
and  |0|  <  1,  then  equations  (3.3.7)  reduce  to  the  single  equation 
A??- |-i  —  0Aft  T  9n\(Xn  Aft),  ft  >  1. 

To  compute  6n\  we  use  Example  3.2.1  to  obtain  yx(0)  =  a2(l  +  200  +  02)/(l  —  02). 
Substituting  in  (3.3.3)  then  gives,  for  ij  >  1, 


K(iJ)  =  • 


(l  +  20(f)  +  #2)  /  (l  —  <p2) ,  i=j  =  1, 
1  +  02,  i  —  j  >  2, 


0. 

0. 


i  -j\  =  1,  i  >  1 

otherwise. 


With  these  values  of  /r  (/,  /),  the  recursions  of  the  innovations  algorithm  reduce  to 
H)  =  (l  +  20(/>  +  (92)/ (l  —  </>2), 

Qni=Q/rn-u  (3.3.9) 

r„  =  I  +  92  -  92/rn_i, 


which  can  be  solved  quite  explicitly  (see  Problem  3.13). 

□ 
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Example  3.3.4  Numerical  Prediction  of  an  ARMA(2,3)  Process 

In  this  example  we  illustrate  the  steps  involved  in  numerical  prediction  of  an 
ARMA(2,3)  process.  Of  course,  these  steps  are  shown  for  illustration  only. 
The  calculations  are  all  carried  out  automatically  by  ITSM  in  the  course  of  computing 
predictors  for  any  specified  data  set  and  ARMA  model.  The  process  we  shall  consider 
is  the  ARMA  process  defined  by  the  equations 

X,  -  +  0.24X/_2  =  Z,  +  0.4Zr_i  +  0.2Z,_2  +  0.1Z,_3,  (3.3.10) 


where  {Zt}  ~  WN(0,  1).  Ten  values  of  X\, . . . ,  Xio  simulated  by  the  program  ITSM 
are  shown  in  Table  3.1.  (These  were  produced  using  the  option  Model>Specify 
to  specify  the  order  and  parameters  of  the  model  and  then  Model >Simulate  to 
generate  the  series  from  the  specified  model.) 

The  first  step  is  to  compute  the  covariances  yx(h),  h  =  0,1,  2,  which  are  easily 
found  from  equations  (3.2.5)  with  k  =  0,  1,  2  to  be 

yx( 0)  =  7.17133,  yx(  1)  =  6.44139,  and  yx( 2)  =  5.0603. 


From  (3.3.3)  we  find  that  the  symmetric  matrix  K  =  [/c(/,  j)L  j=i,2,...  is  given  by 

"7.1713 

6.4414  7.1713 


5.0603 

6.4414 

7.1713 

0.10 

0.34 

0.816 

1.21 

0 

0.10 

0.34 

0.50 

1.21 

0 

0 

0.10 

0.24 

0.50 

1.21 

• 

0 

0 

0.10 

0.24 

0.50 

1.21 

• 

• 

0 

0 

0.10 

0.24 

0.50 

The  next  step  is  to  solve  the  recursions  of  the  innovations  algorithm  for  6nj  and  rn 
using  these  values  for  /c(/,  j).  Then 


Table  3.1  Xn+i  for  the  ARMA(2,3)  Process  of  Example  3.3.4 


n 

Xn+1 

rn 

0/7 1 

0/72 

0n3 

V+i 

0 

1.704 

7.1713 

0 

1 

0.527 

1.3856 

0.8982 

1.5305 

2 

1.041 

1.0057 

1.3685 

0.7056 

-0.1710 

3 

0.942 

1.0019 

0.4008 

0.1806 

0.0139 

1.2428 

4 

0.555 

1.0019 

0.3998 

0.2020 

0.0732 

0.7443 

5 

-1.002 

1.0005 

0.3992 

0.1995 

0.0994 

0.3138 

6 

-0.585 

1 .0000 

0.4000 

0.1997 

0.0998 

-1.7293 

7 

0.010 

1 .0000 

0.4000 

0.2000 

0.0998 

-0.1688 

8 

-0.638 

1 .0000 

0.4000 

0.2000 

0.0999 

0.3193 

9 

0.525 

1 .0000 

0.4000 

0.2000 

0.1000 

-0.8731 

10 

1.0000 

0.4000 

0.2000 

0.1000 

1.0638 

11 

1.0000 

0.4000 

0.2000 

0.1000 

12 

1.0000 

0.4000 

0.2000 

0.1000 
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xn+ 1  =  ^  ■'=' 


y!  ®nj  (xn+l-j  Xn+\ -j'j  , 


n 


=  1,2 


Xn  —  0.24Xn_]  +  0/?/  ( A?+ 1  -/'  _  ^2+1-/)  5  w 

7=1  V 


=  34 

,  -r,  .  .  .  , 


and 


E  (xn+x  —  £/2+i^ 


=  cr  rn  =  rn. 


The  results  are  shown  in  Table  3.1. 


□ 


3.3.1  h- Step  Prediction  of  an  ARMA(p,  q)  Process 


As  in  Section  2.5,  we  use  PnY  to  denote  the  best  linear  predictor  of  Y  in  terms  of 
X\,  ...  ,Xn  (which,  as  pointed  out  after  (3.3.4),  is  the  same  as  the  best  linear  predictor 
of  Y  in  terms  of  W\, . . . ,  Wn).  Then  from  (2.5.30)  we  have 


n+h— 1 


n+h— 1 


Pn^n+h  —  ^  ^  0/2+/?  —  1 ,j  {^^n+h—j  Wn+h— j^  —  ®  ^  ^  0/2+/?  —  1 ,j  (Xn-\-h  —j  A?+/7_/^  . 


j=h 


j—h 


Using  this  result  and  applying  the  operator  Pn  to  each  side  of  equation  (3.3.1),  we 
conclude  that  the  h- step  predictors  PnXn+h  satisfy 


EnXn+h  — 


n+h— 1 


^  ^  0??+/?— 1,/  (Xn+h—j  Xn +/z_/^ 


1  <  h  <  m  —  n, 


j—h 
P 


n+h— 1 


^  ^  <PjP nXn+h-i  T  ^  ^  0?? +/?—  +  /  ??+/?— j  Xn +/?_/^  ,  h  >  m  fl. 

?=1  j—h 

If,  as  is  almost  always  the  case,  n  >  m  =  max(/?,  <7),  then  for  all  h  >  1, 


P nXn+h  —  ^  ^  *PjP nXn+h—i  T  ^  ^  @n+h—  1,  ;(x 

j—h 


n+h—j  Xn+h_j 


/=  1 


(3.3.11) 


(3.3.12) 


Once  the  predictors  X\, . .  ,Xn  have  been  computed  from  (3.3.7),  it  is  a  straightforward 
calculation,  with  /z  fixed,  to  determine  the  predictors  P„Y„+i,  PnXn+2,  PnXn+3,  . . . 
recursively  from  (3.3.12)  (or  (3.3.11)  if  n  <  m).  The  calculations  are  performed 
automatically  in  the  Forecast  ing>ARMA  option  of  the  program  ITSM. 


Example  3.3.5  h- Step  Prediction  of  an  ARMA(2,3)  Process 

To  compute  h- step  predictors,  h  =  1,  . . . ,  10,  for  the  data  of  Example  3.3.4  and 
the  model  (3.3.10),  open  the  project  E334.TSM  in  ITSM  and  enter  the  model  using  the 
option  Model  >  Specify.  Then  select  Fore  casting>  ARMA  and  specify  10  for  the 
number  of  forecasts  required.  You  will  notice  that  the  white  noise  variance  is  au¬ 
tomatically  set  by  ITSM  to  an  estimate  based  on  the  sample.  To  retain  the  model 
value  of  1,  you  must  reset  the  white  noise  variance  to  this  value.  Then  click  OK  and 
you  will  see  a  graph  of  the  original  series  with  the  ten  predicted  values  appended. 
If  you  right-click  on  the  graph  and  select  Info,  you  will  see  the  numerical  results 
shown  in  the  following  table  as  well  as  prediction  bounds  based  on  the  assumption 
that  the  series  is  Gaussian.  (Prediction  bounds  are  discussed  in  the  last  paragraph  of 


92 


Chapter  3  ARMA  Models 


this  chapter.)  The  mean  squared  errors  are  calculated  as  described  below.  Notice  how 
the  predictors  converge  fairly  rapidly  to  the  mean  of  the  process  (i.e.,  zero)  as  the  lead 
time  h  increases.  Correspondingly,  the  one-step  mean  squared  error  increases  from 
the  white  noise  variance  (i.e.,  1)  at  h  =  1  to  the  variance  of  Xt  (i.e.,  7.1713),  which  is 
virtually  reached  at  h  =  10. 

□ 


The  Mean  Squared  Error  of  PnXn+h 

The  mean  squared  error  of  PnXn+k  is  easily  computed  by  ITSM  from  the  formula 

v 

Xr@n+h—r—l,j—r  I  Vn+h—j—li 

(3.3.13) 

where  the  coefficients  Xj  are  computed  recursively  from  the  equations  xo  =  1  and 

min  (p,j) 

Xj=  X  faXj-k,  j=  1,2,....  (3.3.14) 

k=  1 


h- 1 


ojih)  :=  E(Xn+h  -  PnXn+h)2  =  X\X 

j= 0 


r= 0 


Example  3.3.6  /z-Step  Prediction  of  an  ARMA(2,3)  Process 

We  now  illustrate  the  use  of  (3.3.12)  and  (3.3.13)  for  the  h- step  predictors  and  their 
mean  squared  errors  by  manually  reproducing  the  output  of  ITSM  shown  in  Table  3.2. 
From  (3.3.12)  and  Table  3.1  we  obtain 

2  3 

P  10^12  =  ^  0^10^12-Z  + 

1=1  7=2 

=  faXn  +  <p2Xw  +  0.2  (x10  -  Vo)  +  0.1  (x9  -  X9) 

=  1.1217 


and 


Table  3.2  0-step  predictors  for 

the  ARMA(2,3) 

Series  of  Example  3.3.4 


h 

/W10+/7 

Vmse 

1 

1.0638 

1 .0000 

2 

1.1217 

1.7205 

3 

1.0062 

2.1931 

4 

0.7370 

2.4643 

5 

0.4955 

2.5902 

6 

0.3186 

2.6434 

7 

0.1997 

2.6648 

8 

0.1232 

2.6730 

9 

0.0753 

2.6761 

10 

0.0457 

2.6773 
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2  3 

P  10^13  =  L  4>iPioXi3-i  +  L  On j  (Xn-j  —  %n -j) 

i= 1  j= 3 

=  0lPlOX12  +  02*11  +  0.1  (Xio  -  Xio) 

=  1.0062. 

For  /:  >  13,  PioX*  is  easily  found  recursively  from 
PloXk  =  (/>iP\oXk-l  +  faPloXk-2- 

To  find  the  mean  squared  errors  we  use  (3.3.13)  with  xo  =  1,  Xi  =  01  =  1»  and 
X2  =  0i xi  +  02  =  0.76.  Using  the  values  of  6nj  and  y/(=  r;)  in  Table  3.1,  we  obtain 

<Tio( 2)  =  £(*12  -  P10V2)2  =  2.960 
and 

<4(3)  =  £(X13  -  P10X13)2  =  4.810, 
in  accordance  with  the  results  shown  in  Table  3.2. 

□ 


Large-Sample  Approximations 

Assuming  as  usual  that  the  ARMA (p,  q )  process  defined  by  <fi(B)Xt  =  Q(B)Zt ,  {Zt}  ^ 
WN  (0,  a2),  is  causal  and  invertible,  we  have  the  representations 


and 


(3.3.15) 


00 


^n+h  -  A 


n+h 


+  L  vX 

./=! 


n+h—ji 


(3.3.16) 


where  j  i//;-  j  and  j .t, }  are  uniquely  determined  by  equations  (3.1.7)  and  (3.1.8),  respec- 
tively.  Let  PnY  denote  the  best  (i.e.,  minimum  mean  squared  error)  approximation  to 
Y  that  is  a  linear  combination  or  limit  of  linear  combinations  of  Xt,  —  oo  <  t  <  n, 
or  equivalently  [by  (3.3.15)  and  (3.3.16)]  of  Zu  —  oo  <  t  <  n.  The  properties 
of  the  operator  Pn  were  discussed  in  Section  2.5.6.  Applying  Pn  to  each  side  of 
equations  (3.3.15)  and  (3.3.16)  gives 


and 


oo 


Pn^-n+h  —  ^  ^  ^/fjZn+h—j 
j—h 


oo 


P  n^n+h  —  ^  y  TCjPnXn-\-h—j  • 

2=1 


(3.3.17) 


(3.3.18) 


For  h  =  1  the  jth  term  on  the  right  of  (3.3.18)  is  just  Xn+\ _y.  Once  PnXn+  \  has 
been  evaluated,  PnXn+ 2  can  then  be  computed  from  (3.3.18).  The  predictors  PnXn+ 3, 
PnXn+ 4,  . . .  can  then  be  computed  successively  in  the  same  way.  Subtracting  (3.3.17) 
from  (3.3.15)  gives  the  h- step  prediction  error  as 
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h- 1 


Xn-\-h  Pn^-n+h  —  ^  ^  f j^n-\-h—j ? 

7=0 

from  which  we  see  that  the  mean  squared  error  is 


o2{h)  =  a2J2^- 

j= o 

The  predictors  obtained  in  this  way  have  the  form 


(3.3.19) 


OO 


P n^n+h  —  ^  ^  CjXn—j' 
7=0 


(3.3.20) 


In  practice,  of  course,  we  have  only  observations  X\, ...  ,Xn  available,  so  we  must 
truncate  the  series  (3.3.20)  after  n  terms.  The  resulting  predictor  is  a  useful  approx¬ 
imation  to  PnXn+h  if  n  is  large  and  the  coefficients  Cj  converge  to  zero  rapidly  as  j 
increases.  It  can  be  shown  that  the  mean  squared  error  (3.3.19)  of  PnXn+h  can  also  be 
obtained  by  letting  n  ^  o o  in  the  expression  (3.3.13)  for  the  mean  squared  error  of 
PnXn+h ,  so  that  <J2(h)  is  an  easily  calculated  approximation  to  cr2(h)  for  large  n. 


Prediction  Bounds  for  Gaussian  Processes 

If  the  ARMA  process  {Xr}  is  driven  by  Gaussian  white  noise  (i.e.,  if  {Zt}  ~ 
IID  N  (0,  a2)),  then  for  each  h  >  1  the  prediction  error  Xn+h  —  PnXn+h  is  normally 
distributed  with  mean  0  and  variance  cr2(h)  given  by  (3.3.19). 

Consequently,  if  dq^/2  denotes  the  (1  — or/2)  quantile  of  the  standard  normal  dis¬ 
tribution  function,  it  follows  that  Xn+h  lies  between  the  bounds  PnXn+h  ±  d>i-a/2 an(h) 
with  probability  (1  —  a).  These  bounds  are  therefore  called  (1  —  a)  prediction  bounds 
for  Xn+h. 


Problems 


3.1  Determine  which  of  the  following  ARMA  processes  are  causal  and  which  of 
them  are  invertible.  (In  each  case  {Zt}  denotes  white  noise.) 

(a)  X,  +  0.2X,_!  -  0.48X,_2  =  Z,. 

(b)  X,  +  l.9Xt_i  +  0.88Xr_2  =  Zt  +  0.2Z,_!  +  0.7 Z,_2. 

(c)  Xt  +  0.6Xr_!  —  Zt-\-  \.2Zt-\. 

(d)  Xt  +  1.8Xf_i  +  0.81Xr_2  =  Zt. 

(e)  X,  +  1.6X,_!  =Zt-  0.4Z,_!  +  0.04Z,_2. 

3.2  For  those  processes  in  Problem  3.1  that  are  causal,  compute  and  graph  their 
ACF  and  PACF  using  the  program  ITSM. 

3.3  For  those  processes  in  Problem  3.1  that  are  causal,  compute  the  first  six  co¬ 
efficients  fo,  Vq,  •  •  • ,  in  the  causal  representation  Xt  —  Jfjlo  of  {Xr}. 

3.4  Compute  the  ACF  and  PACF  of  the  AR(2)  process 

Xt  =  0.&X,_2  +  Zt,  {Z()~WN(0,ff2). 
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3.5  Let  {Fr}  be  the  ARMA  plus  noise  time  series  defined  by 

Yt  =  Xt  +  Wu 

where  { Wt }  ~  WN  (0,  cr^),  {Ar}  is  the  ARMA (p,  q)  process  satisfying 
0  (B)Xt  =  8  ( B)Zt,  {Ztj  ~  WN  (0,  a2) , 

and  E(WsZt )  =  0  for  all  s  and  t. 

(a)  Show  that  {Yt}  is  stationary  and  find  its  autocovariance  function  in  terms  of 
cr^r  and  the  ACVF  of  {Xt}. 

(b)  Show  that  the  process  Ut  :=  0 (B)  Yt  is  r- correlated,  where  r  —  ma x(p,  q ) 
and  hence,  by  Proposition  2.1.1,  is  an  MA(r)  process.  Conclude  that  {Yt}  is 
an  ARMA0,  r)  process. 

3.6  Show  that  the  two  MA(1)  processes 

X,  =  Zt  +  ez,_u  {Z,}  ~  WN  (0,  a2) 

Yt  —  Zt  +  iz,_j ,  {Zt}  ~  WN  (0,  a2@2), 

V 

where  0  <  \0\  <  1,  have  the  same  autocovariance  functions. 

3.7  Suppose  that  {Xr}  is  the  noninvertible  MA(1)  process 

Xt=Zt  +  GZt—\ ,  (Ztj  ~  WN  (0,  cr2) , 
where  |0|  >  1.  Define  a  new  process  {Wt}  as 

oo 

w,  = 

j= o 

and  show  that  {Wt}  ~  WN  (0,  cr^).  Express  rr^  in  terms  of  8  and  a2  and  show 
that  {A/}  has  the  invertible  representation  (in  terms  of  {Wr}) 

xt  =  wt  +  \wt- 1. 

u 

3.8  Let  {Ar}  denote  the  unique  stationary  solution  of  the  autoregressive  equations 

Xt  —  <pXt_i  +  Zt,  t  =  0,  ±1,  ... , 

where  {Zt}  ~  WN(0,  a2)  and  |0|  >  1.  Then  Ar  is  given  by  the  expression 
(2.2.11).  Define  the  new  sequence 

Wt  =  Xt-j-Xt-l9 
<p 

show  that  {Wt}  ~  WN  (0,  rr^.) .  and  express  a ^  in  terms  of  a2  and  4>.  These 
calculations  show  that  {Xj}  is  the  (unique  stationary)  solution  of  the  causal  AR 
equations 

xt  =  \xt_i  +  Wt,  t  =  0,±1,.... 

<t> 


3.9(a)  Calculate  the  autocovariance  function  y  (•)  of  the  stationary  time  series 

Yt  =  ix  +  Zr  +  0\Zf—\  +  9nZt-n,  {Z,}  ~  WN  (0,  a2). 
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(b)  Use  the  program  ITSM  to  compute  the  sample  mean  and  sample  autocovari¬ 
ances  y(h),  0  <  h  <  20,  of  {VV12XJ,  where  { Xt ,  t  =  1,  . . . ,  72}  is  the 
accidental  deaths  series  DEATHS. TSM  of  Example  1.1.3. 

(c)  By  equating  y  (1),  y  (11),  and  y  (12)  from  part  (b)  to  y  (1),  y  (11),  and  y  (12), 
respectively,  from  part  (a),  find  a  model  of  the  form  defined  in  (a)  to  represent 
{W12x,}. 

3.10  By  matching  the  autocovariances  and  sample  autocovariances  at  lags  0  and  1,  fit 
a  model  of  the  form 

Xt-n  =  <£(V-i  -  /X)  +  z„  {Zr}  ~  WN(0,  a2), 

to  the  data  STRIKES. TSM  of  Example  1.1.6.  Use  the  fitted  model  to  compute 
the  best  predictor  of  the  number  of  strikes  in  1981.  Estimate  the  mean  squared 
error  of  your  predictor  and  construct  95  %  prediction  bounds  for  the  number  of 
strikes  in  1981  assuming  that  {Zt}  ~  iid  N(0,  <j2). 

3.11  Show  that  the  value  at  lag  2  of  the  partial  ACF  of  the  MA(1)  process 

Xt  —  Zt  +  9Zt_  1,  t  —  0,  =bl,  . . . , 
where  {Zt}  ~  WN(0,  a2),  is 

or  (2)  =  -e2/(i+e2  +  e4). 

3.12  For  the  MA(1)  process  of  Problem  3.11,  the  best  linear  predictor  of  Xn+\  based 
on  Xi,  . . . ,  Xn  is 

Xn+\  —  4>n, \Xn  +  *  *  *  +  4>n,nX  1, 

where  <pn  =  ((pni,  . . . ,  (pnn)'  satisfies  Rn4>n  =  Pn  [equation  (2.5.23)].  By  sub¬ 
stituting  the  appropriate  correlations  into  Rn  and  pn  and  solving  the  resulting 
equations  (starting  with  the  last  and  working  up),  show  that  for  1  <  j  <  n, 

cf)n  n_j  —  (l  +  62  H - h  92j)(pnn  and  hence  that  the  PACF  a(n)  :=  (pnn  = 

—  (— 0)w/(l  +  02  +  •  •  •  +  92n). 

3.13  The  coefficients  9nj  and  one-step  mean  squared  errors  vn  =  rno2  for  the  general 
causal  ARMA(1,1)  process  in  Example  3.3.3  can  be  found  as  follows: 

(a)  Show  that  if  yn  \=  rn/(rn  —  1),  then  the  last  of  equation  (3.3.9)  can  be 
rewritten  in  the  form 

y„  =  0~2yn~i  +  1,  n  >  l. 

(b)  Deduce  that  yn=9~2nyo+Ylj=i^~2^~1^  an(^  hence  determine  rn  and 
9n\  1  ft —  1,2,.... 

(c)  Evaluate  the  limits  as  n  ^  00  of  rn  and  9n\  in  the  two  cases  \9\  <  1  and 

\e\  >  1. 


Spectral  Analysis 


4.1  Spectral  Densities 

4.2  The  Periodogram 

4.3  Time-Invariant  Linear  Filters 

4.4  The  Spectral  Density  of  an  ARMA  Process 


This  chapter  can  be  omitted  without  any  loss  of  continuity.  The  reader  with  no  back¬ 
ground  in  Fourier  or  complex  analysis  should  go  straight  to  Chapter  5.  The  spectral 
representation  of  a  stationary  time  series  {X/  }  essentially  decomposes  {X?}  into  a  sum  of 
sinusoidal  components  with  uncorrelated  random  coefficients.  In  conjunction  with  this 
decomposition  there  is  a  corresponding  decomposition  into  sinusoids  of  the  autoco¬ 
variance  function  of  {Xr}.  The  spectral  decomposition  is  thus  an  analogue  for  stationary 
processes  of  the  more  familiar  Fourier  representation  of  deterministic  functions.  The 
analysis  of  stationary  processes  by  means  of  their  spectral  representation  is  often 
referred  to  as  the  “frequency  domain  analysis”  of  time  series  or  “spectral  analysis.” 
It  is  equivalent  to  “time  domain”  analysis  based  on  the  autocovariance  function,  but 
provides  an  alternative  way  of  viewing  the  process,  which  for  some  applications  may 
be  more  illuminating.  For  example,  in  the  design  of  a  structure  subject  to  a  randomly 
fluctuating  load,  it  is  important  to  be  aware  of  the  presence  in  the  loading  force  of  a 
large  sinusoidal  component  with  a  particular  frequency  to  ensure  that  this  is  not  a 
resonant  frequency  of  the  structure.  The  spectral  point  of  view  is  also  particularly 
useful  in  the  analysis  of  multivariate  stationary  processes  and  in  the  analysis  of 
linear  filters.  In  Section  4.1  we  introduce  the  spectral  density  of  a  stationary  process 
{XJ,  which  specifies  the  frequency  decomposition  of  the  autocovariance  function, 
and  the  closely  related  spectral  representation  (or  frequency  decomposition)  of  the 
process  {XJ  itself.  Section  4.2  deals  with  the  periodogram,  a  sample-based  function 
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from  which  we  obtain  estimators  of  the  spectral  density.  In  Section  4.3  we  discuss 
time-invariant  linear  filters  from  a  spectral  point  of  view  and  in  Section  4.4  we  use  the 
results  to  derive  the  spectral  density  of  an  arbitrary  ARMA  process. 


4.1  Spectral  Densities 


Suppose  that  {Xr}  is  a  zero-mean  stationary  time  series  with  autocovariance  function 
]/(•)  satisfying  YlhL-oo  l/WI  <  The  spectral  density  of  {Xr}  is  the  function /(•) 
defined  by 


m 


i 

2n 


J2  e~ihXY(h), 

h=—o o 


— oo  <  A  <  oo, 


(4.1.1) 


where  elX  =  cos(A)  +  i  sin  (A)  and  i  =  V— T-  The  summability  of  |y(-)|  implies  that 

the  series  in  (4.1.1)  converges  absolutely  (since  \elhx\ 2  =  cos 2(/zA)  +  sin 2(h\)  —  1). 
Since  cos  and  sin  have  period  2jt ,  so  also  does  /,  and  it  suffices  to  confine  attention 
to  the  values  of  /,  on  the  interval  ( — tt,  tt]. 


Basic  Properties  of  /: 

(a)  /  is  even,  i.e.,/(A)  =  /(-A), 

(4.1.2) 

(b)  /(A)  >  0  for  all  A  e  (—tt,  tt],  and 

(4.1.3) 

/»  7T  f*7T 

(c)  y(&)  =  1  elkxf(X)  dX  —  1  cos(kX)f(X)dX. 

(4.1.4) 

A— 7T  t/— 7T 

Proof  Since  sin(-)  is  an  odd  function  and  cos(-)  and  y  (•)  are  even  functions,  we  have 


j  oo 

/(A)  =  —  y  (cos (hk)  —  i  sin(/zA))y  (h) 
2tt  ^ 


h=—o o 


^  oo 

=  iy 

in  ^ 


cos(— hX)y(h)  +  0 


h= — oo 


=/(-*). 

For  each  positive  integer  N  define 


fN&)  ~ 


1 


N 


2  nN 


-irk 


r=  1 


1  /A 

- e  >  x, 

\  r=l 


V 


— irk 


Y  XseisX 


S=  1 


=  ^Y{-N~^e~ihxy^- 


\h\<N 
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Clearly,  the  function/^  is  nonnegative  for  each  N,  and  since  fN (X)  ->  YlhL-oo 
e~lhxy(h)  =f(X)a.sN  ->  oo,/ must  also  be  nonnegative.  This  proves  (4. 1.3).  Turning 
to  (4.1.4), 


J  eikxf(X)dX  =  f  ^  J2  ei{k~h)Xy{h)dX 


h = — oo 


i  ^  r 


ei(k-h)x  dx 


h=—o o 


=  K(/0 


since  the  only  nonzero  summand  in  the  second  line  is  the  one  for  which  h  =  k  (see 
Problem  4.1).  ■ 


Equation  (4.1.4)  expresses  the  autocovariances  of  a  stationary  time  series  with 
absolutely  summable  ACVF  as  the  Fourier  coefficients  of  the  nonnegative  even  func¬ 
tion  on  (—7 r,  7T ]  defined  by  (4.1.1).  However,  even  if  YlhL-oo  \y  (h)\  =  oo,  there  may 
exist  a  corresponding  spectral  density  defined  as  follows. 


Definition  4.1.1 


A  function  /  is  the  spectral  density  of  a  stationary  time  series  {2Q}  with  ACVF 

y(-)  if 

(i)  f(X)  >  0  for  all  X  e  (—tv,  tv],  and 

/n 

elhxf(X)  dk  for  all  integers  h. 

-TV 


Remark  1.  Spectral  densities  are  essentially  unique.  That  is,  if  /  and  g  are  two 
spectral  densities  corresponding  to  the  autocovariance  function  y(-),  i.e.,  y(h)  — 
f*  elhxf(X )  dX  —  f*  elhxg(X)  dX  for  all  integers  /z,  then  /  and  g  have  the  same  Fourier 
coefficients  and  hence  are  equal  (see,  for  example,  Brockwell  and  Davis  (1991), 
Section  2.8).  □ 

The  following  proposition  characterizes  spectral  densities. 

Proposition  4.1.1  A  real-valued  function  f  defined  on  (—tv,  tt]  is  the  spectral  density  of  a  real-valued 

stationary  process  if  and  only  if 

(i)  f(X)  =f(-X), 

(ii)  f(X)  >  0,  and 

(iii)  ffnf(X)dX  <  oo. 

Proof  If  yf)  is  absolutely  summable,  then  (i)-(iii)  follow  from  the  basic  properties  of  /, 
(4.1.2)-(4.1.4).  For  the  argument  in  the  general  case,  see  Brockwell  and  Davis  (1991), 
Section  4.3. 

Conversely,  suppose  /  satisfies  (i)-(iii).  Then  it  is  easy  to  check,  using  (i),  that  the 
function  defined  by 

y  (Ji)  —  f  elhxf(X)  dX 

J  —71 
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is  even.  Moreover,  if  ar  e  R,  r  =  1, . . . ,  n,  then 

n  »7 T  n 

ary(r  —  s)as  —  /  araselX(j~s^f{k)  dk 

r,s=  1  n  r,s=  1 


f 

J  —n 


n 


E 

r=l 


ikr 


/(A)  dk 


>0, 

so  that  y(-)  is  also  nonnegative  definite  and  therefore,  by  Theorem  2.1.1,  is  an 
autocovariance  function.  ■ 


Corollary  4.1.1 


An  absolutely  summable  function  y(-)  is  the  autocovariance  function  of  a  stationary 
time  series  if  and  only  if  it  is  even  and 


m 


i 

2n 


oo 

e~ihXY(h)  >  0, 

h=—o O 


/or  o/Z  A  G  (— 7T,  7r], 


(4.1.5) 


Zw  which  casef(-)  is  the  spectral  density  ofy(-). 


Proof  We  have  already  established  the  necessity  of  (4.1.5).  Now  suppose  (4.1.5)  holds. 
Applying  Proposition  4.1.1  (the  assumptions  are  easily  checked)  we  conclude  that  / 
is  the  spectral  density  of  some  autocovariance  function.  But  this  ACVF  must  be  ]/(•), 
since  y  ( k )  =  f*n  elkk  /(A)  dk  for  all  integers  k.  ■ 


Example  4.1.1 


Using  Corollary  4.1.1,  it  is  a  simple  matter  to  show  that  the  function  defined  by 


if  /z  =  0, 
ifh  —  ±1, 


0,  otherwise, 


is  the  ACVF  of  a  stationary  time  series  if  and  only  if  \p\  (see  Example  2.1.1). 
Since  k(/)  is  even  and  nonzero  only  at  lags  0,  ±1,  it  follows  from  the  corollary  that  k 
is  an  ACVF  if  and  only  if  the  function 


m 


i 

2  7T 


J2  e-ihky(h) 

h=—oo 


1 

—  [1  +  2 p  cos  A] 
2n 


is  nonnegative  for  all  A  e  (—tv,  n].  But  this  occurs  if  and  only  if  \p\  < 

□ 

As  illustrated  in  the  previous  example,  Corollary  4.1.1  provides  us  with  a  powerful 
tool  for  checking  whether  or  not  an  absolutely  summable  function  on  the  integers  is 
an  autocovariance  function.  It  is  much  simpler  and  much  more  informative  than  direct 
verification  of  nonnegative  definiteness  as  required  in  Theorem  2.1.1. 

Not  all  autocovariance  functions  have  a  spectral  density.  For  example,  the  sta¬ 
tionary  time  series 


Xt  =  A  cos  {cot)  +  B  sin  (cot), 


(4.1.6) 


where  A  and  B  are  uncorrelated  random  variables  with  mean  0  and  variance  1,  has 
ACVF  y{h)  —  cos (coh)  (Problem  2.2),  which  is  not  expressible  as  ffnelhxf(k)dk, 
with  /  a  function  on  ( — tt,  tt].  Nevertheless,  y(-)  can  be  written  as  the  Fourier 
transform  of  the  discrete  distribution  function 
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Example  4.1.2 


F(X)  = 


0 

<  0.5 

1.0 


if  X  <  —co, 
if  —  co  <  X  <  co, 
if  X  >  co, 


i.e., 


cos  (coh)  = 


l 


eihxdF{X) 


(— 7T,7T] 


where  the  integral  is  as  defined  in  Section  A.l.  As  the  following  theorem  states 
(see  Brockwell  and  Davis  (1991),  p.  117),  every  ACVF  is  the  Fourier  transform  of 
a  (generalized)  distribution  function  on  [ — jr,  tt].  This  representation  is  called  the 

spectral  representation  of  the  ACVF. 


(Spectral  Representation  of  the  ACVF)  A  function  y  ( • )  defined  on  the  integers  is 
the  ACVF  of  a  stationary  time  series  if  and  only  if  there  exists  a  right-continuous, 
nondecreasing,  bounded  function  F  on  [—tv,  tt]  with  F(—tt)  —  0  such  that 

y(h)=  J  e,hldF(X)  (4.1.7) 

for  all  integers  h.  ( For  real-valued  time  series,  F  is  symmetric  in  the  sense  that 
f(a  ^  dF(x)  =  J^_b  _a )  dF(x)  for  all  a  and  b  such  that  0  <  a  <  b.) 

Remark  2.  The  function  F  is  a  generalized  distribution  function  on  [— tt,  tt]  in  the 
sense  that  G{X )  =  F{X)/F(tt)  is  a  probability  distribution  function  on  [—tt,  tt].  Note 
that  since  F(n)  =  y( 0)  =Var(Vi),  the  ACF  of  {Vr}  has  spectral  representation 

p(h)  =  j  elh'dGO,). 

J  (  —  7T,7T] 

The  function  F  in  (4.1.7)  is  called  the  spectral  distribution  function  of  ]/(•).  If  F(X) 
can  be  expressed  as  F(X)  =  f(y)  dy  for  all  A  e  [—tt,  tt],  then /is  the  spectral 
density  function  and  the  time  series  is  said  to  have  a  continuous  spectrum.  If  F  is  a 
discrete  distribution  function  (i.e.,  if  G  is  a  discrete  probability  distribution  function), 
then  the  time  series  is  said  to  have  a  discrete  spectrum.  The  time  series  (4.1.6)  has  a 
discrete  spectrum.  □ 


Linear  Combination  of  Sinusoids 


Consider  now  the  process  obtained  by  adding  uncorrelated  processes  of  the  type 
defined  in  (4.1.6),  i.e., 

k 

Xt  =  Y+Aj  cos (cojt)  +  Bj  sin(cOjt)),  0  <  co\  <  •  •  •  <  co^  <  tt,  (4.1.8) 
7=1 


where  A\,  B\,  . . . ,  A^,  are  uncorrelated  random  variables  with  E(Aj)  —  E(Bj)  =  0 
and  Var (Ay)  =  Var^)  =  of ,  j  =  1,  . . . ,  k.  By  Problem  4.5,  the  ACVF  of  this  time 

series  is  y(h)  =  Ylj=i  Gf  cos(cojh)  and  its  spectral  distribution  function  is  F(X)  = 

E/=  1  tfFjiX),  where 


if  X  <  —coj, 
if  —  coj  <  X  <  coj, 
if  X  >  co  j. 
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Figure  4-1 

A  sample  path  of  size 
1 00  from  the  time  series 
in  Example  4.1 .2 


Figure  4-2 

The  spectral  distribution 
function  F(Z),  —n  <  A  <  n , 
of  the  time  series 
in  Example  4.1 .2 
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A  sample  path  of  this  time  series  with  k  =  2,  co\  =  n /4,  (l>2  =  it/ 6,  o\  —  9,  and 
a22  =  1  is  plotted  in  Figure  4-1.  Not  surprisingly,  the  sample  path  closely  approximates 
a  sinusoid  with  frequency  co\  =  tt/ 4  (and  period  2n /co\  —  8).  The  general  features  of 
this  sample  path  could  have  been  deduced  from  the  spectral  distribution  function  (see 
Figure  4-2),  which  places  90%  of  its  total  mass  at  the  frequencies  ±n/4.  This  means 
that  90%  of  the  variance  of  Xt  is  contributed  by  the  term  A  \  cos(uq0  +  B\  cos(&>i0, 
which  is  a  sinusoid  with  period  8. 

□ 

The  remarkable  feature  of  Example  4.1.2  is  that  every  zero-mean  stationary  pro¬ 
cess  can  be  expressed  as  a  superposition  of  uncorrelated  sinusoids  with  frequencies 
co  e  [0, 7r].  In  general,  however,  a  stationary  process  is  a  superposition  of  infinitely 
many  sinusoids  rather  than  a  finite  number  as  in  (4.1.8).  The  required  generalization 
of  (4.1.8)  that  allows  for  this  is  called  a  stochastic  integral,  written  as 

X,=  J  eihxdZ  (X),  (4.1.9) 

J  (  —  7T,7T] 
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Example  4.1.3 


Example  4.1.4 


where  |Z(A),  —  n  <  A  <  n]  is  a  complex-valued  process  with  orthogonal  (or  un¬ 
correlated)  increments.  The  representation  (4.1.9)  of  a  zero-mean  stationary  process 
{X/}  is  called  the  spectral  representation  of  the  process  and  should  be  compared  with 
the  corresponding  spectral  representation  (4.1.7)  of  the  autocovariance  function  y  (•). 
The  underlying  technical  aspects  of  stochastic  integration  are  beyond  the  scope  of  this 
book;  however,  in  the  simple  case  of  the  process  (4.1.8)  it  is  not  difficult  to  see  that  it 
can  be  reexpressed  in  the  form  (4.1.9)  by  choosing 


dZ(k)  = 


Aj  +  iB, 


4  -  Mj 


if  A,  =  —  coj  and  j  e  {1, . . . ,  k], 
if  A,  =  coj  and  j  e  {1, . . . ,  k }, 


0,  otherwise. 


For  this  example  it  is  also  clear  that 


I 


aj 


E(dZQ,)dZ(X))  ={  2’ 


0, 


if  X  =  ±  coj. 


otherwise. 


In  general,  the  connection  between  dZ  (  k  )  and  the  spectral  distribution  function  of  the 
process  can  be  expressed  symbolically  as 


E(dZ(X)dZ(k))  =  ■ 


F(k)  —  F(X—), 
f('k)dk, 


for  a  discrete  spectrum, 
for  a  continuous  spectrum. 


(4.1.10) 


These  relations  show  that  a  large  jump  in  the  spectral  distribution  function  (or  a  large 
peak  in  the  spectral  density)  at  frequency  ±a>  indicates  the  presence  in  the  time  series 
of  strong  sinusoidal  components  with  frequencies  at  (or  near)  co  radians  per  unit  time. 
The  period  of  a  sinusoid  with  frequency  co  radians  per  unit  time  is  In /co. 


White  Noise 


If  {XJ  ~  WN  (0,  a2),  then  y(0)  =  a2  and  y(h)  =  0  for  all  \h\  >  0.  This  process  has 
a  flat  spectral  density  (see  Problem  4.2) 


—Ti  <  k  <  7 r. 


A  process  with  this  spectral  density  is  called  white  noise,  since  each  frequency  in  the 
spectrum  contributes  equally  to  the  variance  of  the  process. 


□ 


The  Spectral  Density  of  an  AR(1)  Process 


If  {Xr}  is  a  causal  AR(1)  process  satisfying  the  equation, 

X/  =  4>xt_\  +  z„ 

where  {Zt}  ~  WN(0,  o'2),  then  from  (4.1.1),  {XJ  has  spectral  density 

/a)=iT(f ^  (i +|y('-“ +«“) 
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Figure  4-3 

The  spectral  density 
f(k),  0  <  A  <  71,  of 
Xt  =  0.7Xt_i  +Zf,  where 
[Zt }  ~  WN( 0,  ff2) 


Figure  4-4 

The  spectral  density 
f(k),  0  <  A  <  n,  of 
Xt  =  —  0.7Xt_i  +Zf,  where 
{Zt}  ~  WN(0,a2) 


a2  /  0^  <pe~lX  \ 

27r  (l  —  02)  V  +1- (j)eiX  +  1  —  (j)e~iX) 

a2 

—  (l  —  20  cos  X  +  02)  1 . 

2  J  L 


Graphs  of /(A),  0  <  A  <  7T,  are  displayed  in  Figures  4-3  and  4-4  for  0  =  0.7  and 
0  =  —0.7.  Observe  that  for  0  =  0.7  the  density  is  large  for  low  frequencies  and  small 
for  high  frequencies.  This  is  not  unexpected,  since  when  0  =  0.7  the  process  has  a 
positive  ACF  with  a  large  value  at  lag  one  (see  Figure  4-5),  making  the  series  smooth 
with  relatively  few  high-frequency  components.  On  the  other  hand,  for  0  =  —0.7  the 
ACF  has  a  large  negative  value  at  lag  one  (see  Figure  4-6),  producing  a  series  that 
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Figure  4-5 

The  ACF  of  the  AR(1 ) 
process  Xt  =  0.7Xt_i  +  Zj 


Figure  4-6 

The  ACF  of  the  AR(1 ) 
process  Xt  =  — 0 .7Xt_\  +  Zj 


Example  4.1.5 
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fluctuates  rapidly  about  its  mean  value.  In  this  case  the  series  has  a  large  contribution 
from  high-frequency  components  as  reflected  by  the  size  of  the  spectral  density  near 
frequency  n . 


□ 


Spectral  Density  of  an  MA(1)  Process 
If 

Xt  =  Zt  +  6Zt_  i, 

where  {Zt}  ~  WN(0,  a2),  then  from  (4. LI), 

/(A)  =  (1  +  e2  +  e  (e~iX  +  eix))  =  (1  +  2d  cos  A  +  62)  . 

This  function  is  shown  in  Figures  4-7  and  4-8  for  the  values  0  =  0.9  and 
6  =  —0.9.  Interpretations  of  the  graphs  analogous  to  those  in  Example  4.1.4  can 
again  be  made. 

□ 
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Xt  = 


Figure  4-7 

The  spectral  density 
f(k),  0  <  A  <  tv,  of 
Zf  +  0.9Zt_i  where 
{Zfl  ~  WN( 0,  a2) 


Xt  = 


t  —  (J.y Zf_^  where 
{Zt}  -  WN(0,a2) 


0.0 


0.5 
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4.2  The  Periodogram 


If  PQ}  is  a  stationary  time  series  {Xj}  with  ACVF  y(-)  and  spectral  density /(•),  then 
just  as  the  sample  ACVF  y  (•)  of  the  observations  ,  xn }  can  be  regarded  as  a 

sample  analogue  of  y(-),  so  also  can  the  periodogram  /„(•)  of  the  observations  be 
regarded  as  a  sample  analogue  of  2nf(-). 

To  introduce  the  periodogram,  we  consider  the  vector  of  complex  numbers 


X\ 

x2 


X, 


n 


€  C 


n 


X  = 
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Definition  4.2.1 


where  C/?  denotes  the  set  of  all  column  vectors  with  complex-valued  components.  Now 
let  60k  =  2nk/n,  where  k  is  any  integer  between  —  {n—  l)/2  and  n/2  (inclusive),  i.e., 


where  [  y ]  denotes  the  largest  integer  less  than  or  equal  to  y.  We  shall  refer  to  the  set  Fn 
of  these  values  as  the  Fourier  frequencies  associated  with  sample  size  n ,  noting  that 
Fn  is  a  subset  of  the  interval  (—tv,  tt].  Correspondingly,  we  introduce  the  n  vectors 


(4.2.2) 


Now  ei, . . . ,  e„  are  orthonormal  in  the  sense  that 


(l,  if  j  =  k, 

jo,  if  j  +  K 


(4.2.3) 


where  e/  denotes  the  row  vector  whose  fcth  component  is  the  complex  conjugate  of 
the  fcth  component  of  e;  (see  Problem  4.3).  This  implies  that  {ei,  . . . ,  en)  is  a  basis  for 
C/z,  so  that  any  x  e  Cn  can  be  expressed  as  the  sum  of  n  components, 


[n/2\ 

x  =  akek.  (4.2.4) 

k=-[(n- 1)/2] 

The  coefficients  are  easily  found  by  multiplying  (4.2.4)  on  the  left  by  e/7  and  using 
(4.2.3).  Thus, 

*  1 
—  ?k  X  —  — 

The  sequence  {<^}  is  called  the  discrete  Fourier  transform  of  the  sequence 

,  .  .  .  ,  Xyi } . 


J2x,e~ll0Jk.  (4.2.5) 

t=  1 


Remark  1.  The  tth  component  of  (4.2.4)  can  be  written  as 

[n/2\ 

xt  —  akico^cokt)  +  i  sin(^t)]5  t  =  1, . . . ,  n,  (4.2.6) 

k=-[{n- 1)/2] 

showing  that  (4.2.4)  is  just  a  way  of  representing  xt  as  a  linear  combination  of  sine 
waves  with  frequencies  cok  G  Fn.  □ 


The  periodogram  of  {x\ ,  . . . ,  xn]  is  the  function 


(4.2.7) 


Remark  2.  If  X  is  one  of  the  Fourier  frequencies  cok,  then  In{^>k)  —  l^|2,  and  so  from 
(4.2.4)  and  (4.2.3)  we  find  at  once  that  the  squared  length  of  x  is 
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n  [n/ 2]  [n/ 2] 

Y  \xt\2  =  x*x  =  i«*i2  =  Y  ^k)- 

t=  1  £=— [0— 1)/2]  ^=-[(n-l)/2] 

The  value  of  the  periodogram  at  frequency  cc>k  is  thus  the  contribution  to  this  sum  of 
squares  from  the  “frequency  oof'  term  a^k  in  (4.2.4).  □ 


The  next  proposition  shows  that  4(A)  can  be  regarded  as  a  sample  analogue  of 
2i r/(A).  Recall  that  if  YlhL-oo  \  Y  Q0\  <  then 

oo 

2jt/(A.)  =  Y  Y(h)e~ihx,  X  e  (—jt,  n\.  (4.2.8) 

h— — co 


Proposition  4.2.1  If  x\,  ...  ,xn  are  any  real  numbers  and  cok  is  any  of  the  nonzero  Fourier  frequencies 

2ixk/n  in  (—tv,  n],  then 

In(oJk)  =  Y  Y(h)e-’h0\  (4.2.9) 

\h\<n 

where  y  ( h )  is  the  sample  ACVF  ofx\ ,  ...  ,xn. 


Proof  Since  Jft=i  e  ltc°k  =  0  if  <24  /  0,  we  can  subtract  the  sample  mean  x  from  xt  in  the 
defining  equation  (4.2.7)  of  In(cok).  Hence, 

n  n 

In(oJk)  =  n~l  Y  Y. _  x){xt  -  x)e~l(s~t)<0k 

.S— 1  t—\ 


=  Y  Y(h)e~ihmk.  m 

\h\<n 

In  view  of  the  similarity  between  (4.2.8)  and  (4.2.9),  a  natural  estimate  of  the 
spectral  density /(A)  is  4(A)/(27t).  For  a  very  large  class  of  stationary  time  series 
{Xt}  with  strictly  positive  spectral  density,  it  can  be  shown  that  for  any  fixed  frequencies 
Ai,  . . . ,  Am  such  that  0  <  Ai  <  •  •  •  <  Am  <  n,  the  joint  distribution  function 
Fn(x i,  . . . ,  xm)  of  the  periodogram  values  (4(Ai), . . . ,  4(Am))  converges,  as  n  -*  oo, 
to  F{x i ,  . . . ,  xm),  where 


F  (x  1  ,  .  .  .  ,  Xyf) 


f  m 

n 

i=  1 


1  —  exp 


— X; 


2  nfifi) 


0, 


if  X| ,  . . . ,  Xffi  ^  0 . 


otherwise. 


(4.2.10) 


Thus  for  large  n  the  periodogram  ordinates  (4(Ai),  . . . ,  4(Am))  are  approximately 
distributed  as  independent  exponential  random  variables  with  means  27t/(Ai),  ..., 
27r/(Am),  respectively.  In  particular,  for  each  fixed  A  e  (0,  n)  and  6  >  0, 


P[|4(A)  —  2tt/(A)|  >  c]  ->  p  >  0,  as  n  ->  oo, 


so  the  probability  of  an  estimation  error  larger  than  e  cannot  be  made  arbitrarily  small 
by  choosing  a  sufficiently  large  sample  size  n.  Thus,  4(A)  is  not  a  consistent  estimator 
of  27t/(A). 

Since  for  large  n  the  periodogram  ordinates  at  fixed  frequencies  are  approximately 
independent  with  variances  changing  only  slightly  over  small  frequency  intervals,  we 
might  hope  to  construct  a  consistent  estimator  of /(A)  by  averaging  the  periodogram 
estimates  in  a  small  frequency  interval  containing  A,  provided  that  we  can  choose  the 
interval  in  such  a  way  that  its  width  decreases  to  zero  while  at  the  same  time  the  number 
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of  Fourier  frequencies  in  the  interval  increases  to  oo  as  n  ->  oo.  This  can  indeed  be 
done,  since  the  number  of  Fourier  frequencies  in  my  fixed  frequency  interval  increases 
approximately  linearly  with  n.  Consider,  for  example,  the  estimator 

m  =  fT(2m+  1  rlIn{g{n,  X)  +  lit  jin),  (4.2.11) 

2n 

[)\<m 

where  m  =  and  g(n ,  A)  is  the  multiple  of  2n /n  closest  to  A.  The  number  of 
periodogram  ordinates  being  averaged  is  approximately  2 y/n,  and  the  width  of  the 
frequency  interval  over  which  the  average  is  taken  is  approximately  An /  *fn.  It  can  be 
shown  (see  Brockwell  and  Davis  (1991),  Section  1 1.4)  that  this  estimator  is  consistent 
for  the  spectral  density/.  The  argument  in  fact  establishes  the  consistency  of  a  whole 
class  of  estimators  defined  as  follows. 


Definition  4.2.2 


A  discrete  spectral  average  estimator  of  the  spectral  density/(A)  has  the  form 


m 


i 

In 


E  Wn(j)In(g(n,  A)  +  2 nj/n), 
\.j\<mn 


(4.2.12) 


where  the  bandwidths  mn  satisfy 

mn  — >  oo  and  mn/n  0  as  n  oo,  (4.2.13) 

and  the  weight  functions  Wnf)  satisfy 

wn(j)  =  Wn(-j),  Wn(j )  >  0  for  all  j,  (4.2.14) 

Wn(j)  =  h  (4.2.15) 

\j\<mn 

and 

53  wn0')  o  as  n  oo.  (4.2.16) 

\j\<mn 


Remark  3.  The  conditions  imposed  on  the  sequences  { mn }  and  { W,2(-)}  ensure 

A 

consistency  of  /(A)  for  /(A)  for  a  very  large  class  of  stationary  processes 
(see  Brockwell  and  Davis  (1991),  Theorem  10.4.1)  including  all  the  ARMA 
processes  considered  in  this  book.  The  conditions  (4.2.13)  simply  mean  that  the 
number  of  terms  in  the  weighted  average  (4.2.12)  goes  to  oo  as  n  ->  oo  while 
at  the  same  time  the  width  of  the  frequency  interval  over  which  the  average 
is  taken  goes  to  zero.  The  conditions  on  {W/7(-)}  ensure  that  the  mean  and 

A 

variance  of  /(A)  converge  as  n  — >  oo  to  /(A)  and  0,  respectively.  Under  the 
conditions  of  Brockwell  and  Davis  (1991),  Theorem  10.4.1,  it  can  be  shown,  in 
fact,  that 


and 


lim  Ef( A)  =/(A) 


n 


lim 

oo 


E  "to 


I  <mn 


Cov(/(A),/(v)) 


f  2/2(A)  if  A  =  v  =  0  or  n, 

'  /2( A)  if  0  <  A  =  v  <  n, 

.0  if  A  7^  v. 

□ 
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Figure  4-9 

The  spectral  density 
estimate,  /iooM/(2tt), 
0  <  A  <  7 t,  of  the  sunspot 
numbers,  1770-1869 

Example  4.2.1 


Example  4.2.2 


For  the  simple  moving  average  estimator  with  mn  —  «Jn  and  Wn(j)  =  (2 mn  +  1)  1 , 
\j\  <  mn.  Remark  3  gives 


if  A  =  0  or  n, 
if  0  <  A  <  tv. 


□ 


In  practice,  when  the  sample  size  n  is  a  fixed  finite  number,  the  choice  of  m  and 
{TF(-)}  involves  a  compromise  between  achieving  small  bias  and  small  variance  for 
the  estimator  /(A).  A  weight  function  that  assigns  roughly  equal  weights  to  a  broad 
band  of  frequencies  will  produce  an  estimate  of  /(A)  that,  although  smooth,  may  have 
a  large  bias,  since  the  estimate  of /(A)  depends  on  the  values  of  In  at  frequencies  distant 
from  A.  On  the  other  hand,  a  weight  function  that  assigns  most  of  its  weight  to  a  narrow 
frequency  band  centered  at  zero  will  give  an  estimator  with  relatively  small  bias,  but 
with  a  larger  variance.  In  practice  it  is  advisable  to  experiment  with  a  range  of  weight 
functions  and  to  select  the  one  that  appears  to  strike  a  satisfactory  balance  between 
bias  and  variance. 

The  option  Spectrum>Smoothed  Periodogram  in  the  program  ITSM 
allows  the  user  to  apply  up  to  50  successive  discrete  spectral  average  filters  with 
weights  W(j)  =  1/(2 m  +  1),  j  =  —m,  —m  +  1, . . . ,  m,  to  the  periodogram.  The 
value  of  m  for  each  filter  can  be  specified  arbitrarily,  and  the  weights  of  the  filter 
corresponding  to  the  combined  effect  (the  convolution  of  the  component  filters)  is 
displayed  by  the  program.  The  program  computes  the  corresponding  discrete  spectral 

A 

average  estimators  /(A),  0  <  A  <  n. 


The  Sunspot  Numbers,  1770-1869 

Figure 4-9  displays  a  plot  of  times  the  periodogram  of  the  annual  sunspot 

numbers  (obtained  by  opening  the  project  SUNSPOTS. TSM  in  ITSM  and  selecting 
Spectrum>Periodogram).  Figure 4-10  shows  the  result  of  applying  the  discrete 
spectral  weights  {|,  |}  (corresponding  to  m  =  1 ,  W(j)  —  1/ (2m  +  1),  \j  \  <  m). 

It  is  obtained  from  ITSM  by  selecting  Spectrum>Smoothed  Periodogram, 
entering  1  for  the  number  of  Daniell  filters,  1  for  the  order  m,  and  clicking  on  Apply. 
As  expected,  with  such  a  small  value  of  m,  not  much  smoothing  of  the  periodogram 
occurs.  If  we  change  the  number  of  Daniell  filters  to  2  and  set  the  order  of  the  first 
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Figure  4-1 1 

The  spectral  density 

/V 

estimate,  f(k),  0  <  A  <  jv, 
of  the  sunspot  numbers, 

1  770-1 869,  with  weights 

3_  3_  3_  2_  J_  1 

15  ’  15  ’  15  ’  15  ’  15  ’  15  ’  15  J 


filter  to  1  and  the  order  of  the  second  filter  to  2,  we  obtain  a  combined  filter  with  a 
more  dispersed  set  of  weights,  W( 0)  =  W(  1)  =  W( 2)  =  W( 3)  = 

Clicking  on  App  1  y  will  then  give  the  smoother  spectral  estimate  shown  in  Figure  4-11. 
When  you  are  satisfied  with  the  smoothed  estimate  click  OK,  and  the  dialog  box  will 
close.  All  three  spectral  density  estimates  show  a  well-defined  peak  at  the  frequency 
(Oio  =  27T/10  radians  per  year,  in  keeping  with  the  suggestion  from  the  graph  of  the 
data  itself  that  the  sunspot  series  contains  an  approximate  cycle  with  period  around  10 
or  1 1  years. 

□ 


4.3  Time-Invariant  Linear  Filters 

In  Section  1.5  we  saw  the  utility  of  time-invariant  linear  filters  for  smoothing  the  data, 
estimating  the  trend,  eliminating  the  seasonal  and/or  trend  components  of  the  data,  etc. 
A  linear  process  is  the  output  of  a  time-invariant  linear  filter  (TLF)  applied  to  a  white 
noise  input  series.  More  generally,  we  say  that  the  process  {FJ  is  the  output  of  a  linear 
filter  C  =  [ct t,k  =  Oil,...}  applied  to  an  input  process  {XJ  if 


112 


Chapter  4  Spectral  Analysis 


Example  4.3.1 


Example  4.3.2 


Proposition  4.3.1 


OO 

Yf  —  ^  ^  Q,  k^k ?  ^  —  0,  d=l,  .  .  .  . 

k=—o o 


(4.3.1) 


The  filter  is  said  to  be  time-invariant  if  the  weights  ctjt-k  are  independent  of  t,  i.e.,  if 


Cf,t—k  —  &k- 

In  this  case, 

OO 

Yt=  J2  M-k 

k= — oo 

and 

oo 

Yt—s  —  ^  ^  k^t—s—k') 

k=—o o 

so  that  the  time-shifted  process  [Yt_s,  t  =  0,  d=l,  . . .}  is  obtained  from  { Xt_s ,  r  = 
0,  =bl,  . . .}  by  application  of  the  same  linear  filter  =  (Vo,  J  —  0.  ±1, . . .}.  The 
TLF  xj/  is  said  to  be  causal  if 


i/ij  =  0  for  j  <  0, 

since  then  Yt  is  expressible  in  terms  only  of  Xs  ,  s  <t. 


The  filter  defined  by 

Yt  =  aX_t ,  t  =  0,  ±1, 

is  linear  but  not  time-invariant,  since  cu  t-k  —  0  except  when  2 1  —  k.  Thus,  cu  t-k 
depends  on  the  value  of  t. 

□ 


The  Simple  Moving  Average 
The  filter 

Yt  =  (2 q  +  l)"1  J2  XH 

\j \<q 

is  a  TLF  with  xf/j  =  (2 q  +  1  )~l,j  =  —q,  . . . ,  q,  and  xj/j  =  0  otherwise. 

□ 

Spectral  methods  are  particularly  valuable  in  describing  the  behavior  of  time- 
invariant  linear  filters  as  well  as  in  designing  filters  for  particular  purposes  such  as 
the  suppression  of  high-frequency  components.  The  following  proposition  shows  how 
the  spectral  density  of  the  output  of  a  TLF  is  related  to  the  spectral  density  of  the 
input — a  fundamental  result  in  the  study  of  time-invariant  linear  filters. 


Let  {XJ  be  a  stationary  time  series  with  mean  zero  and  spectral  density 
Suppose  that  =  { xf/jj  =  0,  d=l,...}  is  an  absolutely  summable  TLF  {i.e., 
Yljl-o o  IVO‘1  <  °°)-  Then  the  time  series 


is  stationary  with  mean  zero  and  spectral  density 


fy{k) 


V  (e~'k)\2  fx{X)  =  ^(e-a)4 >{eiX)fx(X), 
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where  4>( e  lX)  —  Yljl-oo^j6  (The  function  ^(e  *')  is  called  the  transfer 

function  of  the  filter,  and  the  squared  modulus  (e _i')  |  is  referred  to  as  the  power 
transfer  function  of  the  filter) 


Proof  Applying  Proposition  2.2.1,  we  see  that  {FJ  is  stationary  with  mean  0  and  ACVF 


oo 


Yy (h)  =  fjfk yx(h  +  k- j ) . 

j,k=-o o 

Since  {X{}  has  spectral  density/^  (A),  we  have 

Yx(h  +  k  —  j)  —  fK  ei(h~j+k)Xfx  (X)  dX 

J  —7T 

which,  when  substituted  into  (4.3.2),  gives 


00  pTC 

Yr(h)  -  X  VoVfc  /  e,(h~J+k)x  fx(X)  dX 
■  i  J—  7T 


j ,  k=-o o 


oo  \  /  oo 

,—ijX  |  /  \  A  \  JhX 


J2 


i/fkeikx  )  eihx  fx(X)  dX 


J=-0 O 


\k=—oo 


L 


JT 


Tt 


JhX 


oo 


I]  t 


;e 


-ijX 


j=-oo 


fxfT)  dk. 


(4.3.2) 


(4.3.3) 


The  last  expression  immediately  identifies  the  spectral  density  function  of  {Fr}  as 
fy(X)  =  \x[r(e~lX)\2fx(X)  =  f  (e~lX)f  (e,x)fx(X). 


Remark  4.  Proposition  4.3.1  allows  us  to  analyze  the  net  effect  of  applying  one  or 
more  filters  in  succession.  For  example,  if  the  input  process  {Xr}  with  spectral  density 
fx  is  operated  on  sequentially  by  two  absolutely  summable  TLFs  4q  and  4^,  then 
the  net  effect  is  the  same  as  that  of  a  TLF  with  transfer  function  i/q  (^e~lX)^2(e~lX)  and 
the  spectral  density  of  the  output  process 

Wt  =  f](B)f2(B)Xt 

is  ~lX)\//2(e~lX)  \2  fx(k).  (See  also  Remark  2  of  Section  2.2.)  □ 


As  we  saw  in  Section  1.5,  differencing  at  lag  s  is  one  method  for  removing  a 
seasonal  component  with  period  s  from  a  time  series.  The  transfer  function  for  this 
filter  is  1  —  e~lsX ,  which  is  zero  for  all  frequencies  that  are  integer  multiples  of  2n/s 
radians  per  unit  time.  Consequently,  this  filter  has  the  desired  effect  of  removing  all 
components  with  period  s. 

The  simple  moving-average  filter  in  Example  4.3.2  has  transfer  function 

xl>{e-lX)  =  Dq(X), 
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Figure  4-12 

The  transfer  function 
D\ q(A)  for  the  simple 
moving-average  filter 


0.0  0.5  1.0  1.5  2.0  2.5  3.0 

Frequency 


where  Dq(k)  is  the  Dirichlet  kernel 

|sin[(g  +  0.5)A] 

(2q  +  1)  sin(V2)  ’ 

A  graph  of  Dq  is  given  in  Figure  4-12.  Notice  that  | Dq(X)  |  is  near  1  in  a  neighborhood 
of  0  and  tapers  off  to  0  for  large  frequencies.  This  is  an  example  of  a  low-pass  filter. 
The  ideal  low-pass  filter  would  have  a  transfer  function  of  the  form 


if  A  /  0, 
if  A  =  0. 


1 1,  if  |A|  <  coc, 
0,  if  | A |  >  o)c, 


where  coc  is  a  predetermined  cutoff  value.  To  determine  the  corresponding  linear  filter, 
we  expand  ( e~lX )  as  a  Fourier  series, 


oo 

j=-oo 


with  coefficients 


""  ? 

TC 

sin  (jcoc) 

jn 


if  l/|  >  0. 


(4.3.4) 


We  can  approximate  the  ideal  low-pass  filter  by  truncating  the  series  in  (4.3.4)  at  some 
large  value  q ,  which  may  depend  on  the  length  of  the  observed  input  series.  In  Fig¬ 
ure  4-13  the  transfer  function  of  the  ideal  low-pass  filter  with  wc=n/4  is  plotted  with 
the  approximations  (e~lk)=  Ylj=-q  for  <7=2  and  q=  10.  As  can  be  seen  in 

the  figure,  the  approximations  do  not  mirror  very  well  near  the  cutoff  value  coc  and 
behave  like  damped  sinusoids  for  frequencies  greater  than  coc.  The  poor  approximation 
in  the  neighborhood  of  coc  is  typical  of  Fourier  series  approximations  to  functions  with 
discontinuities,  an  effect  known  as  the  Gibbs  phenomenon.  Convergence  factors  may 
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Figure  4-13 

The  transfer  function  for  the 
ideal  low-pass  filter  and 
truncated  Fourier 
approximations  for 

q  =  2,  10 


0.0  0.5  1.0  1.5  2.0  2.5  3.0 

Frequency 


be  employed  to  help  mitigate  the  overshoot  problem  at  coc  and  to  improve  the  overall 
approximation  of  to  (see  Bloomfield  2000). 


4.4  The  Spectral  Density  of  an  ARMA  Process 

In  Section  4.1  the  spectral  density  was  computed  for  an  MA(1)  and  for  an  AR(1) 
process.  As  an  application  of  Proposition  4.3.1,  we  can  now  easily  derive  the  spectral 
density  of  an  arbitrary  ARMA (p,  q )  process. 


Spectral  Density  of  an  ARMA^^f)  Process:  If  {Xr}  is  a  causal  ARMA (p,  q) 

process  satisfying  (j){B)Xt  —  9(B)Zt ,  then 


a2  I d(e  /A)| 

fxW  =  —  — — —2,  -7T  <  A  <  7T. 

| <p(e 


(4.4.1) 


Because  the  spectral  density  of  an  ARMA  process  is  a  ratio  of  trigonometric  polyno¬ 
mials,  it  is  often  called  a  rational  spectral  density. 


Proof 


From  (3.1.3),  {Xr}  is  obtained  from  {Zt}  by  application  of  the  TLF  with  transfer 
function 


d(e~iX) 

4>{e~lX^ 


Since  { Zt }  has  spectral  density  /Z(A)  =  a2 /(2tv),  the  result  now  follows  from 
Proposition  4.3.1.  ■ 


For  any  specified  values  of  the  parameters  0i,  . . . ,  <pp,  6\,  . . . ,  6q  and  a2,  the 
Spectrum>Model  option  of  ITSM  can  be  used  to  plot  the  model  spectral  density. 
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Figure  4-14 

The  spectral  density 
fx(A),  0  <  A  <  7i  of  the 
AR(2)  model  (3.2.20)  fitted 
to  the  mean-corrected 
sunspot  series 

Example  4.4.1 


Example  4.4.2 


The  Spectral  Density  of  an  AR(2)  Process 


For  an  AR(2)  process  (4.4.1)  becomes 


fxW 


27r(l  —  0i e  lX  —  (p2&  2lX)(l  —  (j)\elX  —  02^2a) 


27T  H-  0j  +  202  +  04  +  2(0102  —  0l)  OOS  A  —  402  COS2  A^ 


Figure 4-14  shows  the  spectral  density,  found  from  the  Spectrum>Model  option 
of  ITSM,  for  the  model  (3.2.20)  fitted  to  the  mean-corrected  sunspot  series.  Notice 
the  well-defined  peak  in  the  model  spectral  density.  The  frequency  at  which  this  peak 
occurs  can  be  found  by  differentiating  the  denominator  of  the  spectral  density  with 
respect  to  cos  A  and  setting  the  derivative  equal  to  zero.  This  gives 


0102  -  01 

cos  A  = - 

402 


0.849. 


The  corresponding  frequency  is  A  =  0.556  radians  per  year,  or  equivalently 
c  =  \/(2n)  —  0.0885  cycles  per  year,  and  the  corresponding  period  is  therefore 
1  /0.0885  =  11.3  years.  The  model  thus  reflects  the  approximate  cyclic  behavior  of  the 
data  already  pointed  out  in  Example  4.2.2.  The  model  spectral  density  in  Figure  4-14 
should  be  compared  with  the  rescaled  periodogram  of  the  data  and  the  nonparametric 
spectral  density  estimates  of  Figures  4-9,  4-10,  and  4-11. 

□ 


The  ARM A(  1,1)  Process 


In  this  case  the  expression  (4.4.1)  becomes 


fxiX) 


cr2(l  +  6eiX){\  +  9e~iX) 
2tt(1  —  4>elX)(  1  —  cj)e~lX ) 

<r2(l  +  62  +  26  cos  A) 


□ 


27t(1  +  02  —  20  cos  A) 
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4.4.1  Rational  Spectral  Density  Estimation 

An  alternative  to  the  spectral  density  estimator  of  Definition  4.2.2  is  the  estimator 
obtained  by  fitting  an  ARMA  model  to  the  data  and  then  computing  the  spectral  density 
of  the  fitted  model.  The  spectral  density  shown  in  Figure  4-14  can  be  regarded  as  such 
an  estimate,  obtained  by  fitting  an  AR(2)  model  to  the  mean-corrected  sunspot  data. 

Provided  that  there  is  an  ARMA  model  that  fits  the  data  satisfactorily,  this  proce¬ 
dure  has  the  advantage  that  it  can  be  made  systematic  by  selecting  the  model  according 
(for  example)  to  the  AICC  criterion  (see  Section  5.5.2).  For  further  information  see 
Brockwell  and  Davis  (1991),  Section  10.6. 


Problems 


4.1  Show  that 


ei(k~h)x  dX 


2n ,  if  k  =  h, 

0,  otherwise. 


4.2  If  {Zt}  ~  WN(0,  a2),  apply  Corollary  4.1.1  to  compute  the  spectral  density  of 


4.3  Show  that  the  vectors  ei, . . . ,  en  are  orthonormal  in  the  sense  of  (4.2.3). 

4.4  Use  Corollary  4.1.1  to  establish  whether  or  not  the  following  function  is  the 
autocovariance  function  of  a  stationary  process  {Ar}: 


y(h) 


1  if  /j  =  0, 

-0.5  if  h  =  ±2, 

-0.25  if  h  =  ±3, 

0  otherwise. 


4.5  If  {A/}  and  {Fr}  are  uncorrelated  stationary  processes  with  autocovariance  func¬ 
tions  Yx(  )  and  yY(-)  and  spectral  distribution  functions  Fx(-)  and  FY( •),  respec¬ 
tively,  show  that  the  process  { Zt  =  Xt  +  Yt }  is  stationary  with  autocovariance 
function  yz  —  Yx  +  yY  and  spectral  distribution  function  Fz  —  Fx  +  FY. 


4.6  Let  {AJ  be  the  process  defined  by 

A t  =  A  cos(7rt/3)  +  B  sin(7T/y3)  +  Yt , 

where  Yt  =  Zt  +  2.5Zr_i,  {Zt}  ~  WN(0,  a2),  A  and  B  are  uncorrelated  with 
mean  0  and  variance  v2,  and  Zt  is  uncorrelated  with  A  and  B  for  each  t.  Find 
the  autocovariance  function  and  spectral  distribution  function  of  {Ar}. 


4.7  Let  {Ar}  denote  the  sunspot  series  filed  as  SUNSPOTS. TSM  and  let  { Yt }  denote 
the  mean-corrected  series  Yt  =  Xt  —  46.93,  t  =  1,  . . . ,  100.  Use  ITSM  to  find 
the  Yule- Walker  AR(2)  model 

Y,  =  0,  rr_!  +  02^r-2  +  [Zt}  ~  WN  (0,  a2) , 

i.e.,  find  <j)\ ,  02,  and  a2.  Use  ITSM  to  plot  the  spectral  density  of  the  fitted  model 
and  find  the  frequency  at  which  it  achieves  its  maximum  value.  What  is  the 
corresponding  period? 
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4.8  (a)  Use  ITSM  to  compute  and  plot  the  spectral  density  of  the  stationary  series 
{XJ  satisfying 

X,  -  0.99X,_3  =  Z„  {Zt}  ~  WN(0,  1). 

(b)  Does  the  spectral  density  suggest  that  the  sample  paths  of  {Xr}  will  exhibit 
approximately  oscillatory  behavior?  If  so,  then  with  what  period? 

(c)  Use  ITSM  to  simulate  a  realization  of  Xi,  . . . ,  X 60  and  plot  the  realization. 
Does  the  graph  of  the  realization  support  the  conclusion  of  part  (b)?  Save  the 
generated  series  as  X.TSM  by  clicking  on  the  window  displaying  the 
graph,  then  on  the  red  EXP  button  near  the  top  of  the  screen.  Select  Time 
Series  and  File  in  the  resulting  dialog  box  and  click  OK.  You  will  then 
be  asked  to  provide  the  file  name,  X.TSM. 

(d)  Compute  the  spectral  density  of  the  filtered  process 

y,  =  I(x,_1+x,+x,+1) 

and  compare  the  numerical  values  of  the  spectral  densities  of  {Xr}  and  {Yt} 
at  frequency  co  —  2n /3  radians  per  unit  time.  What  effect  would  you  expect 
the  filter  to  have  on  the  oscillations  of  {Xr}? 

(e)  Open  the  project  X.TSM  and  use  the  option  Smooth>Moving  Ave . 
to  apply  the  filter  of  part  (d)  to  the  realization  generated  in  part  (c).  Comment 
on  the  result. 


4.9 


The  spectral  density  of 


m  = 


100, 

10, 


and  on  [—tv,  0]  by /(A) 


a  real-valued  time  series  {XJ  is  defined  on  [0,  n]  by 
if  7t/6  —  0.01  <  A  <  7t/6  +  0.01, 

otherwise, 

=/(-*). 


(a)  Evaluate  the  ACVF  of  {Xr}  at  lags  0  and  1. 

(b)  Find  the  spectral  density  of  the  process  {Fr}  defined  by 

Yt  :=  V^X*  —Xt  —  Xt-i2. 

(c)  What  is  the  variance  of  Ytl 

(d)  Sketch  the  power  transfer  function  of  the  filter  Vi2  and  use  the  sketch  to 
explain  the  effect  of  the  filter  on  sinusoids  with  frequencies  (i)  near  zero  and 
(ii)  near  n/6. 


4.10  Suppose  that  {Xr}  is  the  noncausal  and  noninvertible  ARMA(1,1)  process  sat¬ 
isfying 

x,  -  (f)Xt—\  =Z,  +  9Zt-i,  (Ztj  ~  WN  (0,  a2)  , 

where  \<t>\  >  1  and  \6\  >  1.  Define  cf>(B )  =  1  —  and  0(B)  =  1  +  \B  and  let 
{Wt}  be  the  process  given  by 

Vf,  :=  e-'(B)4>(B)Xt. 
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(a)  Show  that  {Wt}  has  a  constant  spectral  density  function. 

(b)  Conclude  that  {Wr}  ~  WN(0,  cr2).  Give  an  explicit  formula  for  a 2  in  terms 
of  </>,  6 ,  and  a2. 

(c)  Deduce  that  4>(B)Xt  =  6{B)Wt ,  so  that  {Xr}  is  a  causal  and  invertible 
ARM A(  1,1)  process  relative  to  the  white  noise  sequence  {Wt}. 


Modeling  and  Forecasting 
with  ARMA  Processes 


5.1  Preliminary  Estimation 

5.2  Maximum  Likelihood  Estimation 

5.3  Diagnostic  Checking 

5.4  Forecasting 

5.5  Order  Selection 


The  determination  of  an  appropriate  ARMA (p,  q )  model  to  represent  an  observed 
stationary  time  series  involves  a  number  of  interrelated  problems.  These  include 
the  choice  of  p  and  q  (order  selection)  and  estimation  of  the  mean,  the  coefficients 
{(pi,  i  —  1, . . .  ,/?},  {9i,  i  =  1  ,...,#},  and  the  white  noise  variance  a2.  Final 
selection  of  the  model  depends  on  a  variety  of  goodness  of  fit  tests,  although  it  can 
be  systematized  to  a  large  degree  by  use  of  criteria  such  as  minimization  of  the 
AICC  statistic  as  discussed  in  Section  5.5.  (A  useful  option  in  the  program  ITSM 
is  Model>Estimation>Autof it,  which  automatically  minimizes  the  AICC 
statistic  over  all  ARMA (p,  q)  processes  with  p  and  q  in  a  specified  range.) 

This  chapter  is  primarily  devoted  to  the  problem  of  estimating  the  parameters 
(p  =  (0/?  . . . ,  0p),  0  =  ( Qt ,  . . . ,  Qq ),  and  a2  when  p  and  q  are  assumed  to  be  known, 
but  the  crucial  issue  of  order  selection  is  also  considered.  It  will  be  assumed  throughout 
(unless  the  mean  is  believed  a  priori  to  be  zero)  that  the  data  have  been  “mean- 
corrected”  by  subtraction  of  the  sample  mean,  so  that  it  is  appropriate  to  fit  a  zero-mean 
ARMA  model  to  the  adjusted  dataxi , . . . ,  xn.  If  the  model  fitted  to  the  mean-corrected 
data  is 

4 p(B)Xt  =  0(B)Zt ,  {Z,}  ~  WN  (0,  a2)  , 

then  the  corresponding  model  for  the  original  stationary  series  {Fr}  is  found  on 
replacing  Xt  for  each  t  by  Yt  —  y,  where  y  —  n~l  yj  *s  the  sample  mean  of  the 
original  data,  treated  as  a  fixed  constant. 

When  p  and  q  are  known,  good  estimators  of  0  and  6  can  be  found  by  imagining 
the  data  to  be  observations  of  a  stationary  Gaussian  time  series  and  maximizing 
the  likelihood  with  respect  to  the  p  +  q  +  1  parameters  0i, . . . ,  0^,  . . . ,  6q 
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and  a2.  The  estimators  obtained  by  this  procedure  are  known  as  maximum  likelihood 
(or  maximum  Gaussian  likelihood)  estimators.  Maximum  likelihood  estimation  is 
discussed  in  Section  5.2  and  can  be  carried  out  in  practice  using  the  ITSM  option 
Model>Estimation>Max  likelihood,  after  first  specifying  a  preliminary 
model  to  initialize  the  maximization  algorithm.  Maximization  of  the  likelihood  and 
selection  of  the  minimum  AICC  model  over  a  specified  range  of  p  and  q  values  can 
also  be  carried  out  using  the  option  Model  >Estimation>Autof  it. 

The  maximization  is  nonlinear  in  the  sense  that  the  function  to  be  maximized  is  not 
a  quadratic  function  of  the  unknown  parameters,  so  the  estimators  cannot  be  found  by 
solving  a  system  of  linear  equations.  They  are  found  instead  by  searching  numerically 
for  the  maximum  of  the  likelihood  surface.  The  algorithm  used  in  ITSM  requires  the 
specification  of  initial  parameter  values  with  which  to  begin  the  search.  The  closer  the 
preliminary  estimates  are  to  the  maximum  likelihood  estimates,  the  faster  the  search 
will  generally  be. 

To  provide  these  initial  values,  a  number  of  preliminary  estimation  algorithms 
are  available  in  the  option  Model>Estimation>Preliminary  of  ITSM.  They 
are  described  in  Section  5.1.  For  pure  autoregressive  models  the  choice  is  between 
Yule-Walker  and  Burg  estimation,  while  for  models  with  q  >  0  it  is  between  the 
innovations  and  Hannan-Rissanen  algorithms.  It  is  also  possible  to  begin  the  search 
with  an  arbitrary  causal  ARMA  model  by  using  the  option  Model>Specify  and 
entering  the  desired  parameter  values.  The  initial  values  are  chosen  automatically  in 
the  option  Model>Estimation>Autof  it. 

Calculation  of  the  exact  Gaussian  likelihood  for  an  ARMA  model  (and  in  fact  for 
any  second-order  model)  is  greatly  simplified  by  use  of  the  innovations  algorithm.  In 
Section  5.2  we  take  advantage  of  this  simplification  in  discussing  maximum  likelihood 
estimation  and  consider  also  the  construction  of  confidence  intervals  for  the  estimated 
coefficients. 

Section  5.3  deals  with  goodness  of  fit  tests  for  the  chosen  model  and  Section  5.4 
with  the  use  of  the  fitted  model  for  forecasting.  In  Section  5.5  we  discuss  the  theoretical 
basis  for  some  of  the  criteria  used  for  order  selection. 

For  an  overview  of  the  general  strategy  for  model-fitting  see  Section  6.2. 


5.1  Preliminary  Estimation 

In  this  section  we  shall  consider  four  techniques  for  preliminary  estimation  of  the 
parameters  </>  =  (</> 1,  . . . ,  0^)',  6  —  (0\,  . . . ,  <fip)' ,  and  a2  from  observations  jci,.  . .,  xn 
of  the  causal  ARMA (p,  q)  process  defined  by 

4>(B)Xr  =  6(B)Z„  {Zt}  ~  WN(0,  a2).  (5.1.1) 

The  Yule-Walker  and  Burg  procedures  apply  to  the  fitting  of  pure  autoregressive 
models.  (Although  the  former  can  be  adapted  to  models  with  q  >  0,  its  performance  is 
less  efficient  than  when  q  =  0.)  The  innovation  and  Hannan-Rissanen  algorithms  are 
used  in  ITSM  to  provide  preliminary  estimates  of  the  ARMA  parameters  when  q  >  0. 

For  pure  autoregressive  models  Burg’s  algorithm  usually  gives  higher  likelihoods 
than  the  Yule- Walker  equations.  For  pure  moving-average  models  the  innovations 
algorithm  frequently  gives  slightly  higher  likelihoods  than  the  Hannan-Rissanen 
algorithm  (we  use  only  the  first  two  steps  of  the  latter  for  preliminary  estimation).  For 
mixed  models  (i.e.,  those  with  p  >  0  and  q  >  0)  the  Hannan-Rissanen  algorithm  is 
usually  more  successful  in  finding  causal  models  (which  are  required  for  initialization 
of  the  likelihood  maximization). 


5.1  Preliminary  Estimation 
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5.1.1  Yule-Walker  Estimation 


For  a  pure  autoregressive  model  the  moving-average  polynomial  0(z)  is  identically  1, 
and  the  causality  assumption  in  (5.1.1)  allows  us  to  write  Xt  in  the  form 

oo 

Xt  =  Y,'l'jZt-j,  (5.1.2) 

7=0 

where,  from  Section  3.1,  x//(z)  =  V^7  =  l/0(z).  Multiplying  each  side  of  (5.1.1) 

by  Xt-j,  j  =  0,  1,2,...,/?,  taking  expectations,  and  using  (5.1.2)  to  evaluate  the  right- 
hand  side  of  the  first  equation,  we  obtain  the  Yule-Walker  equations 

Tpcj)  =  7P  (5.1.3) 

and 


O2  =  y(0)-(f>'lp,  (5.1.4) 

where  Vp  is  the  covariance  matrix  [y(i  —  j) ]^/=|  and  7^  =  (y(  1),  . . . ,  yip))' .  These 

equations  can  be  used  to  determine  y(0), . . . ,  yip)  from  o2  and  </>. 

On  the  other  hand,  if  we  replace  the  covariances  yij),  j  =  0 ,...,/?,  appearing 
in  (5.1.3)  and  (5.1.4)  by  the  corresponding  sample  covariances  yij),  we  obtain  a  set 
of  equations  for  the  so-called  Yule-Walker  estimators  </>  and  a2  of  <fi  and  a2,  namely, 

/V  W 


=  7P 

(5.1.5) 

and 

a2  =  y(0)  -  (j>ryp. 

(5.1.6) 

where  fp  =  [y(i  -j)fij=1  and  %  =  (y(l), . . . ,  yip))'. 

v\ 

If  y(0)  >  0,  then  Ym  is  nonsingular  for  every  m  —  1,2,...  (see  Brockwell  and 
Davis  (1991),  Problem  7.11),  so  we  can  rewrite  equations  (5.1.5)  and  (5.1.6)  in  the 
following  form: 

Sample  Yule- Walker  Equations: 

A  /  A  V  A  1 

0  =  y01?  •  •  •  j  4>p)  —  Rp  Pp 

(5.1.7) 

and 

ct2  =  y(0)  [l  -  p'pR- 1  p;, j  , 

(5.1.8) 

where  pp  =  (p(l), . . . ,  pip))'  =  %/y(0). 

/V  A  A 

With  (p  as  defined  by  (5.1.7),  it  can  be  shown  that  1  —  <p\z  —  •  •  •  —  4>pzp  7^  0  for 
z\  <  1  (see  Brockwell  and  Davis  (1991),  Problem  8.3).  Hence  the  fitted  model 

X,  -  (j)\Xf—\ - 4>pXt_p  =  Zt,  {Zt}  ~  WN(0,  a2) 

is  causal.  The  autocovariances  yFih),  h  =  0, of  the  fitted  model  therefore  satisfy 
the  p  +  1  linear  equations 


w  yv 

yFW  -  -  i) - <PpYF(h-p) 


0,  h=  1, 
ct2,  h  =  0. 


... 
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However,  from  (5.1.5)  and  (5 . 1 .6)  we  see  that  the  solution  of  these  equations  is  yF  ( h )  = 
y  (h),  h  =  0,  . . .  ,p,  so  that  the  autocovariances  of  the  fitted  model  at  lags  0,1,  ...  ,p 
coincide  with  the  corresponding  sample  autocovariances. 

The  argument  of  the  preceding  paragraph  shows  that  for  every  nonsingular 
covariance  matrix  of  the  form  rp+i  =  \y(i  —  j)  }  there  is  an  AR (p)  process  whose 
autocovariances  at  lags  0,  ...  ,p  are  y(0),...,y(p).  (The  required  coefficients  and 
white  noise  variance  are  found  from  (5.1.7)  and  (5.1.8)  on  replacing  p(j)  by  y(j)/y(0), 
j  =  0, ...  ,p,  and  y(0)  by  y(0).)  There  may  not,  however,  be  an  MA (p)  process  with 
this  property.  For  example,  if  y(0)  =  1  and  y(  1)  =  y(—  1)  =  ft,  the  matrix  V2  is  a 
nonsingular  covariance  matrix  for  all  ft  e  (—1,1).  Consequently,  there  is  an  AR(1) 
process  with  autocovariances  1  and  ft  at  lags  0  and  1  for  all  ft  e  (—1,1).  However, 
there  is  an  MA(1)  process  with  autocovariances  1  and  ft  at  lags  0  and  1  if  and  only  if 
\ft\  <  \.  (See  Example  2.1.1). 

A 

It  is  often  the  case  that  moment  estimators,  i.e.,  estimators  that  (like  <ft)  are  obt¬ 
ained  by  equating  theoretical  and  sample  moments,  have  much  higher  variances  than 
estimators  obtained  by  alternative  methods  such  as  maximum  likelihood.  However, 
the  Yule- Walker  estimators  of  the  coefficients  0i,  . . . ,  4>p  of  an  AR(p)  process  have 
approximately  the  same  distribution  for  large  samples  as  the  corresponding  maximum 
likelihood  estimators.  For  a  precise  statement  of  this  result  see  Brockwell  and  Davis 
(1991),  Section  8.10.  For  our  purposes  it  suffices  to  note  the  following: 


Large-Sample  Distribution  of  Yule- Walker  Estimators: 

For  a  large  sample  from  an  AR(p)  process, 

4>  ~  N  (</>,  n-lcr2r~l)  . 


If  we  replace  a1  and  Vp  by  their  estimates  a2  and  Tp,  we  can  use  this  result  to  find 
large-sample  confidence  regions  for  0  and  each  of  its  components  as  in  (5.1.12)  and 
(5.1.13)  below. 


Order  Selection 

In  practice  we  do  not  know  the  true  order  of  the  model  generating  the  data.  In  fact,  it 
will  usually  be  the  case  that  there  is  no  true  AR  model,  in  which  case  our  goal  is  simply 
to  find  one  that  represents  the  data  optimally  in  some  sense.  Two  useful  techniques  for 
selecting  an  appropriate  AR  model  are  given  below.  The  second  is  more  systematic 
and  extends  beyond  the  narrow  class  of  pure  autoregressive  models. 


•  Some  guidance  in  the  choice  of  order  is  provided  by  a  large-sample  result  (see 
Brockwell  and  Davis  (1991),  Section  8.10),  which  states  that  if  {Xt}  is  the  causal 
AR (p)  process  defined  by  (5.1.1)  with  {Zt}  ~  iid(0,  a2)  and  if  we  fit  a  model  with 
order  m  >  p  using  the  Yule- Walker  equations,  i.e.,  if  we  fit  a  model  with  coefficient 
vector 


4*171  —  R,n 


1 


m  >  p, 


then  the  last  component ,  0mm,  of  the  vector  0m  is  approximately  normally  dis- 
tributed  with  mean  0  and  variance  1  /n.  Notice  that  0mm  is  exactly  the  sample  partial 
autocorrelation  at  lag  m  as  defined  in  Section  3.2.3. 

Now,  we  already  know  from  Example  3.2.6  that  for  an  AR (/?),  process  the  partial 
autocorrelations  0mm,  m  >  p,  are  zero.  By  the  result  of  the  previous  paragraph, 
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if  an  AR (p)  model  is  appropriate  for  the  data,  then  the  values  0^,  k  >  p ,  should 
be  compatible  with  observations  from  the  distribution  N(0,  1  /n).  In  particular,  for 
k  >  p,  fak  will  fall  between  the  bounds  ±1.96 n~l/2  with  probability  close  to  0.95. 
This  suggests  using  as  a  preliminary  estimator  of  p  the  smallest  value  m  such  that 
4>kk  <  1.96 n~l/2  for  k  >  in. 


The  program  ITSM  plots  the  sample  PACF  {0mm,  m  =  1,2,...}  together  with  the 
bounds  ±1.96/ y/n.  From  this  graph  it  is  easy  to  read  off  the  preliminary  estimator 
of  p  defined  above. 


A  more  systematic  approach  to  order  selection  is  to  find  the  values  of  p  and  <pp  that 
minimize  the  AICC  statistic  (see  Section  5.5.2  below) 


AICC  =  -2  In S(4>p)/ri)  +  2 (p  +  1  )n/(n-p  -  2), 

where  L  is  the  Gaussian  likelihood  defined  in  (5.2.9)  and  S  is  defined  in  (5.2.11). 
The  Preliminary  Estimation  dialog  box  of  ITSM  (opened  by  pressing  the 
blue  PRE  button)  allows  you  to  search  for  the  minimum  AICC  Yule-Walker  (or 
Burg)  models  by  checking  Find  AR  model  with  min  AICC.  This  causes 
the  program  to  fit  autoregressions  of  orders  0,  1,  . . . ,  27  and  to  return  the  model 
with  smallest  AICC  value. 


Definition  5.1.1 


The  fitted  Yule- Walker  AR(/w)  model  is 

A  A  /  \ 

A/  (pm\Xt—i  •  •  •  (pmm^t—m  —  A/ ,  {Zf }  ~  WN  (0,  VmJ  , 

(5.1.9) 

where 

A  /  /V  /V  \  l  A  1  ^ 

0m  —  (0ml  5  •  •  •  >  0mm  )  =  Ryn  Pm 

(5.1.10) 

and 

Vm  =  Y( 0)  [l  -p'mKnPm\  ■ 

(5.1.11) 

For  both  approaches  to  order  selection  we  need  to  fit  AR  models  of  gradually 
increasing  order  to  our  given  data.  The  problem  of  solving  the  Yule- Walker  equations 
with  gradually  increasing  orders  has  already  been  encountered  in  a  slightly  different 
context  in  Section  2.5.3,  where  we  derived  a  recursive  scheme  for  solving  the 
equations  (5.1.3)  and  (5.1.4)  with  p  successively  taking  the  values  1,2,....  Here  we 
can  use  exactly  the  same  scheme  (the  Durbin-Levinson  algorithm)  to  solve  the  Yule- 
Walker  equations  (5.1.5)  and  (5.1.6),  the  only  difference  being  that  the  covariances 
in  (5.1.3)  and  (5.1.4)  are  replaced  by  their  sample  counterparts.  This  is  the  algorithm 
used  by  ITSM  to  perform  the  necessary  calculations. 

Confidence  Regions  for  the  Coefficients 

Under  the  assumption  that  the  order  p  of  the  fitted  model  is  the  correct  value,  we  can 

A 

use  the  asymptotic  distribution  of  cf>p  to  derive  approximate  large-sample  confidence 
regions  for  the  true  coefficient  vector  <fip  and  for  its  individual  components  4>Pj.  Thus, 
if  X\ l-aip)  denotes  the  (1  —  a)  quantile  of  the  chi-squared  distribution  with p  degrees 
of  freedom,  then  for  large  sample-size  n  the  region 


(5.1.12) 
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contains  <pp  with  probability  close  to  (1  —  a).  (This  follows  from  Problem  A. 7  and 
the  fact  that  +Jn(4>p  —  <fip)  is  approximately  normally  distributed  with  mean  0  and 

/v  . 

covariance  matrix  vpT~.)  Similarly,  if  <Y>i-a  denotes  the  (1  —  a)  quantile  of  the 

/V  . 

standard  normal  distribution  and  vjj  is  the  jth  diagonal  element  of  vpT~  ,  then  for  large 
n  the  interval  bounded  by 

4>pj  ±  <S>i-a/2n-y2vJ2  (5.1.13) 

contains  <pPj  with  probability  close  to  (1  —  a). 

The  Dow  Jones  Utilities  Index,  Aug.  28-Dec.  18,  1972;  DOWJ.TSM 

The  very  slowly  decaying  positive  sample  ACF  of  the  time  series  contained  in  the 
file  DOWJ.TSM  this  time  series  suggests  differencing  at  lag  1  before  attempting  to 
fit  a  stationary  model.  One  application  of  the  operator  (1—5)  produces  a  new  series 
{Yt}  with  no  obvious  deviations  from  stationarity.  We  shall  therefore  try  fitting  an  AR 
process  to  this  new  series 

Yt  =  A-A-! 

using  the  Yule-Walker  equations.  There  are  77  values  of  Yt ,  which  we  shall  denote 
by  Y\,  . . . ,  F77.  (We  ignore  the  unequal  spacing  of  the  original  data  resulting  from 
the  five-day  working  week.)  The  sample  autocovariances  of  the  series  yu  ...  ,yn  are 
Y  (0)  =  0.17992,  y(  1)  =  0.07590,  y( 2)  =  0.04885,  etc. 

Applying  the  Durbin-Levinson  algorithm  to  fit  successively  higher-order  autore¬ 
gressive  processes  to  the  data,  we  obtain 

0n  =  p(l)  =  0.4219, 

v\  =  y(0)  [l  —  p2(l)]  =  0.1479, 

022  =  [j>(2)  -  0nK(l)]  /vi  =0.1138, 

021  =  011  —  011022  =  0.3739, 

v2  =  Vi  [l  —  022 1  =  0.1460. 

The  sample  ACF  and  PACF  of  the  data  can  be  displayed  by  pressing  the  second 
yellow  button  at  the  top  of  the  ITSM  window.  They  are  shown  in  Figures  5-1  and  5-2, 
respectively.  Also  plotted  are  the  bounds  ±1.96/\/77.  Since  the  PACF  values  at  lags 
greater  than  1  all  lie  between  the  bounds,  the  first  order- selection  criterion  described 
above  indicates  that  we  should  fit  an  AR(1)  model  to  the  data  set  {F^}.  Unless  we  wish 
to  assume  that  { Yt }  is  a  zero-mean  process,  we  should  subtract  the  sample  mean  from 
the  data  before  attempting  to  fit  a  (zero-mean)  AR(1)  model.  When  the  blue  PRE 
(preliminary  estimation)  button  at  the  top  of  the  ITSM  window  is  pressed,  you  will  be 
given  the  option  of  subtracting  the  mean  from  the  data.  In  this  case  (as  in  most)  click 
Yes  to  obtain  the  new  series 

=  Yt  —  0.1336. 

You  will  then  see  the  Preliminary  Estimation  dialog  box.  Enter  1  for  the  AR 
order,  zero  for  the  MA  order,  select  Yule -Walker,  and  click  OK.  We  have  already 
computed  0n  and  above  using  the  Durbin-Levinson  algorithm.  The  Yule-Walker 
AR(1)  model  obtained  by  ITSM  for  {X/}  is  therefore  (not  surprisingly) 


X,  -  0.4219Xr_!  =  Zt,  {Z,}  -  WN(0,  0.1479), 


(5.1.14) 
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Figure  5-1 

The  sample  ACF  of 
the  differenced  series 
{Yj}  in  Example  5.1 .1 


0  5  10  15  20  25  30 

Lag 


Figure  5-2 

The  sample  PACF  of 
the  differenced  series 
{Yt}  in  Example  5.1 .1 


and  the  corresponding  model  for  {Yt}  is 

Yt  -  0.1336  -  0.4219(F,_i  -  0.1336)  =  Z„  {Zt}  ~  WN(0,  0.1479). 

(5.1.15) 

Assuming  that  our  observed  data  really  are  generated  by  an  AR  process  with 
p  =  1,  (5.1.13)  gives  us  approximate  95  %  confidence  bounds  for  the  autoregressive 
coefficient  </>, 

(1.96)  (0.1479) 1/2 

0.4219  ±  - - - - —  =  (0.2194,  0.6244). 

(0.17992)  02777 

Besides  estimating  the  autoregressive  coefficients,  ITSM  computes  and  prints  out 
the  ratio  of  each  coefficient  to  1.96  times  its  estimated  standard  deviation.  From  these 
numbers  large-sample  95  %  confidence  intervals  for  each  of  the  coefficients  are  easily 
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obtained.  In  this  particular  example  there  is  just  one  coefficient  estimate,  <p\  —  0.4219, 
with  ratio  of  coefficient  to  1. 96  x  standard  error  equal  to  2.0832.  Hence  the  required 
95  %  confidence  bounds  are  0.4219  ±  0.4219/2.0832  =  (0.2194,  0.6244),  as  found 
above. 

A  useful  technique  for  preliminary  autoregressive  estimation  that  incorporates 
automatic  model  selection  (i.e.,  choice  of  p )  is  to  minimize  the  AICC  [see  equa¬ 
tion  (5.5.4)]  over  all  fitted  autoregressions  of  orders  0  through  27.  This  is  achieved 
by  selecting  both  Yule  -Walker  and  Find  AR  model  with  min  AICC  in  the 
Preliminary  Estimation  dialog  box.  (The  MA  order  must  be  set  to  zero,  but 
the  AR  order  setting  is  immaterial.)  Click  OK,  and  the  program  will  search  through 
all  the  Yule- Walker  AR (p)  models,  p  —  0,  1,  . . . ,  27,  selecting  the  one  with  smallest 
AICC  value.  The  minimum-AICC  Yule- Walker  AR  model  turns  out  to  be  the  one 
defined  by  (5.1.14)  with p  =  1  and  AICC  value  74.541. 


□ 


Yule-Walker  Estimation  with  q  >  0;  Moment  Estimators 

The  Yule-Walker  estimates  for  the  parameters  in  an  AR (p)  model  are  examples 
of  moment  estimators:  The  autocovariances  at  lags  0,  1,  ...  ,p  are  replaced  by  the 
corresponding  sample  estimates  in  the  Yule- Walker  equations  (5.1.3),  which  are  then 
solved  for  the  parameters  </>  =  (</>i,  . . . ,  <ppy  and  a2.  The  analogous  procedure 
for  ARMA(p,  q)  models  with  q  >  0  is  easily  formulated,  but  the  corresponding 
equations  are  nonlinear  in  the  unknown  coefficients,  leading  to  possible  nonexistence 
and  nonuniqueness  of  solutions  for  the  required  estimators. 

From  (3.2.5),  the  equations  to  be  solved  for  0i, . . . ,  cj)p,  6\,  . . . ,  6q  and  o 2  are 


y(k)  -  0! y(k-  1)  - 


-  (f)py{k-p)  =  cr2^29j\/fj-k,  0  <  k  <  p  +  q, 

j=k 

(5.1.16) 


where  x//j  must  first  be  expressed  in  terms  of  <fi  and  0  using  the  identity  xjz(z)  — 
0(z)/0(z)  ( 6q  :=  1  and  0j  —  xjzj  —  0  for  j  <  0). 


For  the  MA(1)  model  the  equation  (5. 1. 16)  are  equivalent  to 

y(0)  =  a2(i  +e\ 


(5.1.17) 


PCI)  = 


0i 


(5.1.18) 


i+ef 

If  p( I)  >  0.5,  there  is  no  real  solution,  so  we  define  6\  —  p(  1)/  /)( 1)  .  If  /5( I) 
0.5,  then  the  solution  of  (5.1 . 17)— (5. 1.18)  (with  \§\  <  1)  is 


< 


#!  =  (l  —  (l  —  4/32(l)) 1/2)  /  (2/5(1)) , 


a2  =  p(0)/p+@12 

For  the  overshort  data  of  Example  3.2.8,  p(  1)  =  —0.5035  and  y(0)  =  3416,  so  the 
fitted  MA(1)  model  has  parameters  6\  —  —1.0  and  <j2  =  1708. 

□ 
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Relative  Efficiency  of  Estimators 

The  performance  of  two  competing  estimators  is  often  measured  by  computing  their 
asymptotic  relative  efficiency.  In  a  general  statistics  estimation  problem,  suppose  6fl} 
and  2)  are  two  estimates  of  the  parameter  0  in  the  parameter  space  0  based  on  the 
observations  X\ ,  ...  ,Xn.  If  0®  is  approximately  N  (9,  a2(0))  for  large  n,i  =  1,2,  then 
the  asymptotic  efficiency  of  l}  relative  to  0/72)  is  defined  to  be 


e 


If  <?(0,  0(1),  0(2))  <  I  for  all  9  e  0,  then  we  say  that  9 is  a  more  efficient  estimator 
of  9  than  9^  (strictly  more  efficient  if  in  addition,  e(9 ,  0(1),  0(2))  <  1  for  some  9  e 
0).  For  the  MA(I)  process  the  moment  estimator  07j1}  discussed  in  Example  5.1.2  is 
approximately  N(0i,  cr^(9i)/nj  with 


ol(0x)  =  (i  +  e\  +  4  q\  +  e\  +  6»f)/(i  -  off 


(see  Brockwell  and  Davis  (1991),  p.  254).  On  the  other  hand,  the  innovations  estimator 
9(2)  discussed  in  the  next  section  is  distributed  approximately  as  N(0i,  n~l).  Thus, 

e(0\,  0(1),  0(2))  =  <xf2(0i)  <  1  for  all  \9\\  <  1,  with  strict  inequality  when  0  /  1.  In 
particular, 


£ 


I 


0.82,  9\  =  0.25, 
0.37,  0i  =  0.50, 
0.06,  =  0.75, 


demonstrating  the  superiority,  at  least  in  terms  of  asymptotic  relative  efficiency,  of  07p} 
over  9^\  On  the  other  hand  (Section  5.2),  the  maximum  likelihood  estimator  07|3)  of 
9 1  is  approximately  N(0i,  (1  —  02)/w).  Hence, 


e 


[  0.94,  0i  =  0.25, 

<  0.75,  0i  =  0.50, 

0.44,  0i  =  0.75. 


While  0;(?3)  is  more  efficient,  0/?2)  has  reasonably  good  efficiency,  except  when  |0i|  is 
close  to  1,  and  can  serve  as  initial  value  for  the  nonlinear  optimization  procedure  in 
computing  the  maximum  likelihood  estimator. 

While  the  method  of  moments  is  an  effective  procedure  for  fitting  autoregressive 
models,  it  does  not  perform  as  well  for  ARMA  models  with  q  >  0.  From  a  computa¬ 
tional  point  of  view,  it  requires  as  much  computing  time  as  the  more  efficient  estimators 
based  on  either  the  innovations  algorithm  or  the  Hannan-Rissanen  procedure  and  is 
therefore  rarely  used  except  when  q  =  0. 


5.1 .2  Burg's  Algorithm 

/V  A 

The  Yule- Walker  coefficients  ... ,  4>pp  are  precisely  the  coefficients  of  the  best 
linear  predictor  of  Xp+i  in  terms  of  { Xp ,  . . . ,  X\}  under  the  assumption  that  the  ACF 
of  {A/}  coincides  with  the  sample  ACF  at  lags  I,  ...  ,p. 

Burg’s  algorithm  estimates  the  PACF  {0n,  </>2 2,  . . .}  by  successively  minimizing 
sums  of  squares  of  forward  and  backward  one-step  prediction  errors  with  respect  to  the 
coefficients  <f>n.  Given  observations  {xi,  . . . ,  xn)  of  a  stationary  zero-mean  time  series 
{A/}  we  define  Ui(t),t  =  i  +  1, . . . ,  n,  0  <  i  <  n,  to  be  the  difference  between 
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xn+i+i-t  and  the  best  linear  estimate  of  xn+\+i-t  in  terms  of  the  preceding  i  observations. 
Similarly,  we  define  Vi(t),t  =  i  +  1, . . . ,  n,  0  <  i  <  n,  to  be  the  difference  between 
xn+\-t  and  the  best  linear  estimate  of  xn+\_t  in  terms  of  the  subsequent  i  observations. 
Then  it  can  be  shown  (see  Problem  5.6)  that  the  forward  and  backward  prediction 
errors  {ui(t)}  and  {v/(0}  satisfy  the  recursions 

«0  (0  =  V0(0  =X„+i_„ 


Uj(t)  =  Uj—  |  (t  -1)  -  4>UVi-\(t), 


(5.1.19) 


and 


Vi(t)  =  V,_  |  (?)  -  (/>,;«,_  |  (?  -  1). 


(5.1.20) 


Burg’s  estimate  <b  j  ^ '  of  (p\  i  is  found  by  minimizing 
,  1 


n 


ax  := 


2(n  -  1) 


^  [«,(?)  +  A?)] 


t= 2 


with  respect  to  0n.  This  gives  corresponding  numerical  values  for  u\(t)  and  v\{t)  and 
that  can  then  be  substituted  into  (5.1.19)  and  (5.1 .20)  with  i  =  2.  Then  we  minimize 


1 


n 


a2  := 


2  (n  -  2) 


£[«i(')  +  vfr)] 


f=3 


with  respect  to  022  to  obtain  the  Burg  estimate  0^}  of  022  and  corresponding  values 
of  U2(t ),  V2 (t),  and  a22.  This  process  can  clearly  be  continued  to  obtain  estimates  0^} 

and  corresponding  minimum  values,  cr^2,  p  <  n  —  1.  Estimates  of  the  coefficients 
<pPj,  1  <  j  <  p  —  1,  in  the  best  linear  predictor 


PpXp-\~ i  —  (pp\Xp  T  ■  ■  ■  T  (pppX i 

are  then  found  by  substituting  the  estimates  0/(f  ) ,  /  =  1, for  0Z/  in  the  recursions 
(2.5.20)-(2.5.22).  The  resulting  estimates  of  0/;/,  j  =  1,  . . .  ,/7,  are  the  coefficient 
estimates  of  the  Burg  AR(p)  model  for  the  data  {jci  , . . . ,  jc„}.  The  Burg  estimate  of  the 
white  noise  variance  is  the  minimum  value  found  in  the  determination  of  0^}. 

The  calculation  of  the  estimates  of  0/?/;  and  cr2  described  above  is  equivalent  (Problem 
5.7)  to  solving  the  following  recursions: 


Burg’s  Algorithm: 

n 

d{  l)  =  J>g(f-l)  +  v§(0), 

t=2 


n 


<pT  =  vf_i(0Mi_i(/  - 1) 

f— i-}- 1 


d(;  +  1)  =  (l  -  d(i )  -  ( i  +  1)  -  uf(n) 


(5)2 

A  = 


=  (l  -  d(i)]  /[2(n  -  /)]• 
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Example  5.1.3 


Example  5.1.4 


Figure  5-3 

The  sample  ACF  of  the  lake 
data  in  Example  5.1.4 


The  large-sample  distribution  of  the  estimated  coefficients  for  the  Burg  estimators 
of  the  coefficients  of  an  AR (p)  process  is  the  same  as  for  the  Yule-Walker  estimators, 
namely,  N(</>,  n~[cr2r^{).  Approximate  large-sample  confidence  intervals  for  the 

coefficients  can  be  found  as  in  Section  5.1.1  by  substituting  estimated  values  for  a2 
and  Vp. 

The  Dow  Jones  Utilities  Index 

The  fitting  of  AR  models  using  Burg’s  algorithm  in  the  program  ITSM  is  completely 
analogous  to  the  use  of  the  Yule-Walker  equations.  Applying  the  same  transformations 
as  in  Example  5.1.1  to  the  Dow  Jones  Utilities  Index  and  selecting  Burg  instead 
of  Yule-Walker  in  the  Preliminary  Estimation  dialog  box,  we  obtain  the 
minimum  AICC  Burg  model 

-  0.437 1Y?_!  =  Z„  {Ztj  ~  WN(0,  0.1423),  (5.1.21) 

with  AICC  =  74.492.  This  is  slightly  different  from  the  Yule-Walker  AR(I)  model 
fitted  in  Example  5. LI,  and  it  has  a  larger  likelihood  L,  i.e.,  a  smaller  value  of 
— 21nL  (see  Section  5.2).  Although  the  two  methods  give  estimators  with  the  same 
large-sample  distributions,  for  finite  sample  sizes  the  Burg  model  usually  has  smaller 
estimated  white  noise  variance  and  larger  Gaussian  likelihood.  From  the  ratio  of  the 
estimated  coefficient  to  (1.96x  standard  error)  displayed  by  ITSM,  we  obtain  the  95  % 
confidence  bounds  for  </>:  0.4371  ±  0.4371/2. 1668  =  (0.2354,  0.6388). 

□ 


The  Lake  Data 

This  series  { Yt ,  t  —  I, . . . ,  98}  has  already  been  studied  in  Example  1.3.5.  In  this 
example  we  shall  consider  the  problem  of  fitting  an  AR  process  directly  to  the  data 
without  first  removing  any  trend  component.  A  graph  of  the  data  was  displayed  in 
Figure  1-9.  The  sample  ACF  and  PACF  are  shown  in  Figures  5-3  and  5-4,  respectively. 

The  sample  PACF  shown  in  Figure  5-4  strongly  suggests  fitting  an  AR(2)  model 
to  the  mean-corrected  data  Xt  =  Yt  —  9.0041.  After  clicking  on  the  blue  preliminary 
estimation  button  of  ITSM  select  Yes  to  subtract  the  sample  mean  from  {F^}.  Then 
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Figure  5-4 

The  sample  PACF  of  the 
lake  data  in  Example  5.1 .4 


0  10  20  30  40 

Lag 


specify  2  for  the  AR  order,  0  for  the  MA  order,  and  Burg  for  estimation.  Click  OK  to 
obtain  the  model 

-  1.0449 ±  0.2456Yr_2  =  Z„  {Zt}  ~  WN(0,  0.4706), 
with  AICC  value  213.55  and  95  %  confidence  bounds 

0i  :  1.0449  ±  1.0449/5.5295  =  (0.8559,  1.2339), 

02  :  -0.2456  ±0.2456/1.2997  =  (-0.4346,  -0.0566). 

Selecting  the  Yule- Walker  method  for  estimation,  we  obtain  the  model 

-  1.0538X,_!  +  0.2668Y,_2  =  Z„  {Ztj  ~  WN(0,  0.4920), 
with  AICC  value  213.57  and  95  %  confidence  bounds 

0i  :  1.0538  ±  1.0538/5.5227  =  (0.8630,  1.2446), 

02  :  -0.2668  ±0.2668/1.3980  =  (-0.4576,  -0.0760). 

We  notice,  as  in  Example  5.1.3,  that  the  Burg  model  again  has  smaller  white  noise 
variance  and  larger  Gaussian  likelihood  than  the  Yule- Walker  model. 

If  we  determine  the  minimum  AICC  Yule-Walker  and  Burg  models,  we  find  that 
they  are  both  of  order  2.  Thus  the  order  suggested  by  the  sample  PACF  coincides  again 
with  the  order  obtained  by  AICC  minimization. 

□ 


5.1 .3  The  Innovations  Algorithm 


Just  as  we  can  fit  autoregressive  models  of  orders  1,  2,  ...  to  the  data  [x\,  . . . ,  xn]  by 
applying  the  Durbin-Levinson  algorithm  to  the  sample  autocovariances,  we  can  also 
fit  moving  average  models 

Xt  =  Zt  ±  Qm\Zt_\  ±  •  •  •  ±  6mmZt_m ,  [Zt]  ~  WN  (0,  vm)  (5.1.22) 


of  orders  m  —  1,  2,  ...  by  means  of  the  innovations  algorithm  (Section  2.5.4).  The 

^  /  /'V  /'V  x  / 

estimated  coefficient  vectors  6m  (0m i, . . . ,  6mm)  and  white  noise  variances  vm, 
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m  —  1,  2,  . . are  specified  in  the  following  definition.  (The  justification  for  using 
estimators  defined  in  this  way  is  contained  in  Remark  1  following  the  definition.) 


Definition  5.1.2 


The  fitted  innovations  MA(m)  model  is 

Xt  =  Zt  +  em\Zt-i  +  •  •  •  +  9mmZt-m ,  {Ztj  ~  WN(0,  Vffi) , 

yv 

where  9m  and  vm  are  obtained  from  the  innovations  algorithm  with  the  ACVF 
replaced  by  the  sample  ACVF. 


Remark  1.  It  can  be  shown  (see  Brockwell  and  Davis  1988)  that  if  {Xt}  is  an  invertible 
MA(g)  process 

Xt  =  Zt  +  0\Zt—\  +  •  •  •  +  9qZt_q,  (Ztj  ~  IID  (0,  a2)  , 

with  EZ 4  <  oo,  and  if  we  define  0O  =  1  and  9j  —  0  for  j  >  q ,  then  the  innovation 
estimates  have  the  following  large-sample  properties.  If  n  — >  oo  and  m{ri)  is  any 
sequence  of  positive  integers  such  that  m(n)  ->  oo  but  n~l/3m(n)  ->  0,  then  for  each 
positive  integer  k  the  joint  distribution  function  of 

^  ^  ml  9\ ,  9m 2  ^2?  •  •  •  •>  9 tnfc  9j^\ 


converges  to  that  of  the  multivariate  normal  distribution  with  mean  0  and  covariance 
matrix  A  =  [%]f7=1,  where 


min(U) 

E 


9i—r9j—r> 


r=  1 


(5.1.23) 


This  result  enables  us  to  find  approximate  large-sample  confidence  intervals  for  the 
moving-average  coefficients  from  the  innovation  estimates  as  described  in  the  exam¬ 
ples  below.  Moreover,  the  estimator  vm  is  consistent  for  a2  in  the  sense  that  for  every 
6  >  0,  P(\vm  —  a2  >  e)  — >  0  as  m  00.  □ 


Remark  2.  Although  the  recursive  fitting  of  moving-average  models  using  the  inno¬ 
vations  algorithm  is  closely  analogous  to  the  recursive  fitting  of  autoregressive  models 
using  the  Durbin-Levinson  algorithm,  there  is  one  important  distinction.  For  an 

yv 

AR {p)  process  the  Yule-Walker  and  Burg  estimators  <fip  are  consistent  estimators  of 
(0i,  ... ,  4>PY  as  the  sample  size  n  — >  00.  However,  for  an  MA(0  process  the  estimator 

yv 

0q  —  (0qli  . . . ,  9qq)r  is  not  consistent  for  (9 !,...,  9q)\  For  consistency  it  is  necessary 
to  use  the  estimators  (9m  1,  . . . ,  9mqy  with  m(n )  satisfying  the  conditions  of  Remark  I. 
The  choice  of  m  for  any  fixed  sample  size  can  be  made  by  increasing  m  until  the  vector 
(9m  1, . . . ,  9mq)'  stabilizes.  It  is  found  in  practice  that  there  is  a  large  range  of  values  of 
m  for  which  the  fluctuations  in  9mj  are  small  compared  with  the  estimated  asymptotic 

standard  deviation  n~111  as  f°un(i  from  (5.1.23)  when  the  coefficients  9j 

yv 

are  replaced  by  their  estimated  values  9mj.  □ 


Order  Selection 

Three  useful  techniques  for  selecting  an  appropriate  MA  model  are  given  below.  The 
third  is  more  systematic  and  extends  beyond  the  narrow  class  of  pure  moving-average 
models. 
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Example  5.1.5 


•  We  know  from  Section  3.2.2  that  for  an  MA(g)  process  the  autocorrelations  p(m), 
m  >  q,  are  zero.  Moreover,  we  know  from  Bartlett’s  formula  (Section  2.4)  that  the 
sample  autocorrelation  p(m),  m  >  q,  is  approximately  normally  distributed  with 

mean  p{m)  =  0  and  variance  n~{  [l  +  2/r(l)  H - +  2 p2(q)\  This  result  enables 

us  to  use  the  graph  of  p(m),  m  —  1,2,...,  both  to  decide  whether  or  not  a  given 
data  set  can  be  plausibly  modeled  by  a  moving-average  process  and  also  to  obtain 
a  preliminary  estimate  of  the  order  q  as  the  smallest  value  of  m  such  that  p(k )  is  not 
significantly  different  from  zero  for  all  k  >  m.  For  practical  purposes  “significantly 
different  from  zero”  is  often  interpreted  as  “larger  than  1.96 / ^fn  in  absolute  value” 
(cf.  the  corresponding  approach  to  order  selection  for  AR  models  based  on  the 
sample  PACF  and  described  in  Section  5.1.1). 

•  If  in  addition  to  examining  p(m),  m  =  1,  2,  . . .,  we  examine  the  coefficient  vectors 
9m,  m  —  1,  2,  . . . ,  we  are  able  not  only  to  assess  the  appropriateness  of  a  moving- 
average  model  and  estimate  its  order  q ,  but  at  the  same  time  to  obtain  preliminary 

/V  A 

estimates  9m i, . . . ,  9mq  of  the  coefficients.  By  inspecting  the  estimated  coefficients 

A  A  /V 

9m i,  . . . ,  9mm  for  m  —  1,2,...  and  the  ratio  of  each  coefficient  estimate  9mj  to 

1.96  times  its  approximate  standard  deviation  oj  =  we  can 

see  which  of  the  coefficient  estimates  are  most  significantly  different  from  zero, 
estimate  the  order  of  the  model  to  be  fitted  as  the  largest  lag  j  for  which  the  ratio  is 
larger  than  1  in  absolute  value,  and  at  the  same  time  read  off  estimated  values  for 
each  of  the  coefficients.  A  default  value  of  m  is  set  by  the  program,  but  it  may  be 

A  A 

altered  manually.  As  m  is  increased  the  values  6m i, . . . ,  6mm  stabilize  in  the  sense 
that  the  fluctuations  in  each  component  are  of  order  ri~l/2,  the  asymptotic  standard 
deviation  of  0m\. 

•  As  for  autoregressive  models,  a  more  systematic  approach  to  order  selection  for 
moving-average  models  is  to  find  the  values  of  q  and  6q  —  [6m i,  . . . ,  9mq)  that 
minimize  the  AICC  statistic 

AICC  =  — 21n L(0q,  S{Oq)/ri)  +  2 (q  +  1  )n/(n  -q-  2), 

where  L  is  the  Gaussian  likelihood  defined  in  (5.2.9)  and  S  is  defined  in  (5.2.11). 
(See  Section  5.5  for  further  details.) 


Confidence  Regions  for  the  Coefficients 

Asymptotic  confidence  regions  for  the  coefficient  vector  9q  and  for  its  individual 
components  can  be  found  with  the  aid  of  the  large-sample  distribution  specified  in 
Remark  1.  For  example,  approximate  95  %  confidence  bounds  for  9j  are  given  by 


§mj  ±  1.96 n~l/2 


(5.1.24) 


The  Dow  Jones  Utilities  Index 


In  Example  5.1.1  we  fitted  an  AR(1)  model  to  the  differenced  Dow  Jones  Utilities 
Index.  The  sample  ACF  of  the  differenced  data  shown  in  Figure  5-1  suggests  that 
an  MA(2)  model  might  also  provide  a  good  fit  to  the  data.  To  apply  the  innovation 
technique  for  preliminary  estimation,  we  proceed  as  in  Example  5.1.1  to  difference 
the  series  DOWJ.TSM  to  obtain  observations  of  the  differenced  series  {FJ.  We  then 
select  preliminary  estimation  by  clicking  on  the  blue  PRE  button  and  subtract  the  mean 
of  the  differences  to  obtain  observations  of  the  differenced  and  mean-corrected  series 
{Xr}.  In  the  Preliminary  Estimation  dialog  box  enter  0  for  the  AR  order  and 
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2  for  the  MA  order,  and  select  Innovations  as  the  estimation  method.  We  must 
then  specify  a  value  of  m,  which  is  set  by  default  in  this  case  to  17.  If  we  accept  the 

A  A 

default  value,  the  program  will  compute  dn.i, . . . ,  $17,17  and  print  out  the  first  two 
values  as  the  estimates  of  $1  and  62,  together  with  the  ratios  of  the  estimated  values  to 
their  estimated  standard  deviations.  These  are 

MA  COEFFICIENT 

0.4269  0.2704 

COEFFICIENT/(  1 ,96*STANDARD  ERROR) 

1.9114  1.1133 

The  remaining  parameter  in  the  model  is  the  white  noise  variance,  for  which  two 
estimates  are  given: 

WN  VARIANCE  ESTIMATE  =  (RESID  SS)/N 

0.1470 

INNOVATION  WN  VARIANCE  ESTIMATE 

0.1122 

The  first  of  these  is  the  average  of  the  squares  of  the  rescaled  one-step  prediction  errors 

under  the  fitted  MA(2)  model,  i.e.,  \  \Xj  —  Xj)  !rj- 1-  The  second  value  is  the 

innovation  estimate,  vn.  (By  default  ITSM  retains  the  first  value.  If  you  wish  instead 
to  use  the  innovation  estimate,  you  must  change  the  white  noise  variance  by  selecting 
Model>Specify  and  setting  the  white  noise  value  to  the  desired  value.)  The  fitted 
model  for  Xt(=  Yt  —  0.1336)  is  thus 

Xt  =  Zt  +  0.4269Z,_!  +  0.2704Z,_2,  {Zt}  ~  WN(0,  0.1470), 

with  AICC  =  77.467. 

A 

To  see  all  17  estimated  coefficients  0nj,j  =  1, . . . ,  17,  we  repeat  the  preliminary 
estimation,  this  time  fitting  an  MA(17)  model  with  m—Yl.  The  coefficients  and  ratios 
for  the  resulting  model  are  found  to  be  as  follows: 


MA  COEFFICIENT 


0.4269 

0.2704 

0.1183 

0.1589 

0.1355 

0.1568 

0.1284 

-0.0060 

0.0148 

0.0760 

-0.0017 

0.1974 

-0.0463 

0.2023 

0.1285 

-0.0213 

-0.2575 

COEFFICIENT/(F96*STANDARD  ERROR) 

1.9114 

1.1133 

0.4727 

0.6314 

0.5331 

0.6127 

0.4969 

-0.0231 

0.0568 

-0.0064 

0.7594 

-0.1757 

0.7667 

0.4801 

-0.0792 

-0.9563 

0.2760 


The  ratios  indicate  that  the  estimated  coefficients  most  significantly  different  from  zero 
are  the  first  and  second,  reinforcing  our  original  intention  of  fitting  an  MA(2)  model  to 

A 

the  data.  Estimated  coefficients  0mj  for  other  values  of  m  can  be  examined  in  the  same 
way,  and  it  is  found  that  the  values  obtained  for  m  >  17  change  only  slightly  from  the 
values  tabulated  above. 

By  fitting  MA(g)  models  of  orders  0,  1,  2,  . . . ,  26  using  the  innovations  algorithm 
with  the  default  settings  for  m,  we  find  that  the  minimum  AICC  model  is  the  one  with 
q  =  2  found  above.  Thus  the  model  suggested  by  the  sample  ACF  again  coincides 
with  the  more  systematically  chosen  minimum  AICC  model. 

□ 
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Innovations  Algorithm  Estimates  when  p  >  0  and  q  >  0 
The  causality  assumption  (Section  3.1)  ensures  that 

oo 

^  tyjZt-j, 
j= 0 


where  the  coefficients  xj/j  satisfy 

min (j,  p ) 

xlrj  =  ej  +  ^2  Mj-u  j  =  0,1,...,  (5.1.25) 

i=l 

and  we  define  Q0  :=  1  and  0j  :=  0  for  j  >  q.  To  estimate  xpi,  ... ,  xj/p+q  we  can  use 

A  A 

the  innovation  estimates  6m i,  . . . ,  0m,p+q,  whose  large-sample  behavior  is  specified  in 

A 

Remark  1.  Replacing  xf/j  by  0mj  in  (5.1.25)  and  solving  the  resulting  equations 

min(/» 

@mj  —  @j  T  ^  ^  j  —  1  '>•••'>  P  (5.1.26) 

i=  1 


for  0  and  6 ,  we  obtain  initial  parameter  estimates  0  and  0.  To  solve  (5.1.26)  we  first 
find  0  from  the  last  q  equations: 


A  _ 

@m,q+ 1 

_  /V  /V  /V  _ 

@mq  @m,q—  1  *  *  *  @m,q+l—p 

^  ^  ^ 

01 

@m,q+ 2 

• 

• 

— 

@m,q+ 1  @m,q  ’  ’  ’  @m,q+2—p 

mm  m 

mm  m 

02 

• 

• 

• 

A 

@m,q+p  _ 

mm  m 

_  @m,q-\-p—\  @m,q+p— 2  '  '  '  @m,q  _ 

1 

CJh 

•  ^ 
_ 1 

(5.1.27) 


Having  solved  (5.1.27)  for  0  (which  may  not  be  causal),  we  can  easily  determine  the 
estimate  of  6  from 


yv  yv 


min(/,  p) 


(pfim,  j—i  ■ 


i=X 


Finally,  the  white  noise  variance  a2  is  estimated  by 


-  2 

<7 


E(v-v)V/-h 

t=  1 


where  X,  is  the  one-step  predictor  of  Xt  computed  from  the  fitted  coefficient  vectors  0 
and  6 ,  and  rt_\  is  defined  in  (3.3.8). 

The  above  calculations  can  all  be  carried  out  by  selecting  the  ITSM  option  Mode  1  > 
Estimation>Preliminary.  This  option  also  computes,  if  p  =  q,  the  ratio  of 
each  estimated  coefficient  to  1.96  times  its  estimated  standard  deviation.  Approximate 
95  %  confidence  intervals  can  therefore  easily  be  obtained  in  this  case.  If  the  fitted 
model  is  noncausal,  it  cannot  be  used  to  initialize  the  search  for  the  maximum 
likelihood  estimators,  and  so  the  autoregressive  coefficients  should  be  set  to  some 
causal  values  (e.g.,  all  equal  to  0.001)  using  the  Model  >Specify  option.  If  both  the 
innovation  and  Hannan-Rissanen  algorithms  give  noncausal  models,  it  is  an  indication 
(but  not  a  conclusive  one)  that  the  assumed  values  of  p  and  q  may  not  be  appropriate 
for  the  data. 
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Order  Selection  for  Mixed  Models 

For  models  with  p  >  0  and  q  >  0,  the  sample  ACF  and  PACF  are  difficult  to  recognize 
and  are  of  far  less  value  in  order  selection  than  in  the  special  cases  where  p  =  0  or 
q  =  0.  A  systematic  approach,  however,  is  still  available  through  minimization  of  the 
AICC  statistic 

AICC  =  -2 In L(cj)p,  0q ,  S((j)p,  0q)/ri)  +  2(p  +  q  +  l)n/(n  -p  -  q  -  2), 

which  is  discussed  in  more  detail  in  Section  5.5.  For  fixed  p  and  q  it  is  clear  from  the 
definition  that  the  AICC  value  is  minimized  by  the  parameter  values  that  maximize  the 
likelihood.  Hence,  final  decisions  regarding  the  orders  p  and  q  that  minimize  AICC 
must  be  based  on  maximum  likelihood  estimation  as  described  in  Section  5.2. 

Example  5.1 .6  The  Lake  Data 

In  Example  5.1.4  we  fitted  AR(2)  models  to  the  mean-corrected  lake  data  using  the 
Yule-Walker  equations  and  Burg’s  algorithm.  If  instead  we  fit  an  ARMA(I,I)  model 
using  the  innovations  method  in  the  option  Model >Estimation>Preliminary 
of  ITSM  (with  the  default  value  m  —  17),  we  obtain  the  model 

Xt  -  0.7234Y,_!  =  Z,  +  0.3596Z,_!,  {Ztj  ~  WN(0,  0.4757), 

for  the  mean-corrected  series  Xt  =  Yt  —  9.0041.  The  ratio  of  the  two  coefficient 

A  A 

estimates  0  and  6  to  1.96  times  their  estimated  standard  deviations  are  given  by  ITSM 
as  3.2064  and  1.85 13,  respectively.  The  corresponding  95  %  confidence  intervals  are 
therefore 


0  :  0.7234  ±  0.7234/3.2064  =  (0.4978,  0.9490), 

6  :  0.3596  ±  0.3596/1.8513  =  (0.1654,  0.5538). 

It  is  interesting  to  note  that  the  value  of  AICC  for  this  model  is  212.89,  which  is 
smaller  than  the  corresponding  values  for  the  Burg  and  Yule-Walker  AR(2)  mod¬ 
els  in  Example  5.1.4.  This  suggests  that  an  ARM A(  1,1)  model  may  be  superior  to 
a  pure  autoregressive  model  for  these  data.  Preliminary  estimation  of  a  variety  of 
ARMA(p,  q)  models  shows  that  the  minimum  AICC  value  does  in  fact  occur  when 
p  —  q  —  1.  (Before  committing  ourselves  to  this  model,  however,  we  need 
to  compare  AICC  values  for  the  corresponding  maximum  likelihood  models.  We  shall 
do  this  in  Section  5.2.) 

□ 


5.1.4  The  Hannan-Rissanen  Algorithm 

The  defining  equations  for  a  causal  AR (p)  model  have  the  form  of  a  linear  regression 
model  with  coefficient  vector  0  =  (0 1,  . . . ,  4>p)r .  This  suggests  the  use  of  simple 
least  squares  regression  for  obtaining  preliminary  parameter  estimates  when  q  —  0. 
Application  of  this  technique  when  q  >  0  is  complicated  by  the  fact  that  in 
the  general  ARMA (p,  q)  equations  Xt  is  regressed  not  only  on Xt-i, . . . ,  Xt_p ,  but  also 
on  the  unobserved  quantities  Zt_ i,  . . . ,  Zt_q.  Nevertheless,  it  is  still  possible  to  apply 
least  squares  regression  to  the  estimation  of  0  and  6  by  first  replacing  the  unobserved 

A  A 

quantities  Zr_i,  . . . ,  Zt_q  in  (5. 1.1)  by  estimated  values  Z,_i,  . . . ,  Zt_q.  The  parameters 

A  A 

0  and  6  are  then  estimated  by  regressing  Xt  onto  Xt_i,  . . . ,  Xt_p,  Zt-\ ,  . . . ,  Zt_q.  These 
are  the  main  steps  in  the  Hannan-Rissanen  estimation  procedure,  which  we  now 
describe  in  more  detail. 
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Step  1.  A  high-order  AR(m)  model  (with  m  >  max( p,  q ))  is  fitted  to  the  data  using  the 
Yule- Walker  estimates  of  Section  5.1.1.  If  (</>m i,  . . . ,  0mm)  is  the  vector  of  estimated 
coefficients,  then  the  estimated  residuals  are  computed  from  the  equations 


Z^  —  Y/  •  •  •  <pmmXt. 


-mi 


t  =  m  +  1,  . . . ,  n. 


A 

Step  2.  Once  the  estimated  residuals  Zt ,  t  =  m  +  l, ...  ,n,  have  been  computed  as 
in  Step  1,  the  vector  of  parameters,  (3  =  (<//,  0')  is  estimated  by  least  squares  linear 

regression  of  Xt  onto  (Yr_i,  . . . ,  Xt_p,  Zr_ i,  . . . ,  Zf_^y),  t  =  m  +  1  +  q,  . . . ,  n,  i.e.,  by 
minimizing  the  sum  of  squares 


n 


S(J3)  =  J2  (V-0|V-I - -  0,Z(_, - 

t=m+  l+q 

with  respect  to  (3.  This  gives  the  Hannan-Rissanen  estimator 

b  =  ( z'zylz'xn , 


where  Xn  =  ( Xm+\+q ,  . . . ,  Xn)f  and  Z  is  the  (n  —  m  —  q)  x  (p  +  q)  matrix 


A  A  A 


Ym-f -q 

Xm+q-l 

y 

^m-\-q 

A 

Zim+q—  1 

A 

•  •  •  Zm+1 

A 

z  = 

X-m+q+1 

• 

• 

X-m+q 

• 

• 

Xm-\-q-\-2—p 

• 

• 

•  •  • 

Zm+q-\- 1 

• 

• 

y 

^ m+q 

• 

• 

•  •  •  Zm_|_2 

• 

• 

•  •  • 

1 

. 

1 

• 

V-2 

• 

X 

^ n—p 

Z„- 1 

Z„— 2 

• 

A 

. . .  7 

^ n—q 

(If  p  =  0,  Z  contains  only  the  last  q  columns.)  The  Hannan-Rissanen  estimate  of  the 
white  noise  variance  is 


n  —  m  —  q 


Example  5.1 .7  The  Lake  Data 


In  Example  5.1.6  an  ARM A(  1,1)  model  was  fitted  to  the  mean  corrected  lake  data 
using  the  innovations  algorithm.  We  can  fit  an  ARM A(  1,1)  model  to  these  data  using 
the  Hannan-Rissanen  estimates  by  selecting  Hannan  -  Ri  s  sanen  in  the  Preliminary 
Estimation  dialog  box  of  ITSM.  The  fitted  model  is 

Yr  -  0.696  1X,_i  =  Zr  +  0.3788Zf_i,  {Zt}  -  WN(0,  0.4774), 

for  the  mean-corrected  series  Xt  =  Yt  —  9.0041 .  (Two  estimates  of  the  white  noise  vari¬ 
ance  are  computed  in  ITSM  for  the  Hannan-Rissanen  procedure,  <r^R  and  Y^]=  i  (^/  — 

Xt-\ Y/n.  The  latter  is  the  one  retained  by  the  program.)  The  ratios  of  the  two  co¬ 
efficient  estimates  to  1.96  times  their  standard  deviation  are  4.5289  and  1.3120, 
respectively.  The  corresponding  95  %  confidence  bounds  for  </>  and  6  are 

0  :  0.6961  ±  0.6961/4.5289  =  (0.5424,  0.8498), 

6  :  0.3788  ±  0.3788/1.3120  =  (0.0901,  0.6675). 

Clearly,  there  is  little  difference  between  this  model  and  the  one  fitted  using  the 
innovations  method  in  Example  5.1.6.  (The  AICC  values  are  213.18  for  the  current 
model  and  212.89  for  the  model  fitted  in  Example  5.1.6.) 

□ 

Hannan  and  Rissanen  include  a  third  step  in  their  procedure  to  improve  the 
estimates. 


5.2  Maximum  Likelihood  Estimation 
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/v 

Step  3.  Using  the  estimate  /3 


,  yv  yv  yv  yv  v  / 

(0i , . . . ,  <pp,  6\ , . . . ,  6q)  from  Step  2,  set 


j= i 


Now  for  t  =  1 , ,n  put 


if  t  <  ma x(p,  g), 

XI  if  f  >  max(p,  <y). 

,/=i 


x  4>jVt-j  +  z„ 

J= 1 


if  t  <  ma x(p,  q ), 
if  ^  >  max(p,  g), 


and 


10,  if  t  <  ma x(/7,  g), 

-  X  ^7^-7  +  zt ,  if  t  >  ma x(p,  t q) . 

yv  y\  /-w 

(Observe  that  both  Vt  and  satisfy  the  AR  recursions  (f){B)Vt  —  Zt  and  9(B)Wt  =  Zt 

yv 

for  r  =  1,  . . . ,  n.)  If  /31  is  the  regression  estimate  of  (3  found  by  regressing  Zt  on 
(V,_1, . . . ,  V,_p,  Wf_i,  . . . ,  wt-q),  i.e.,  if  /3f  minimizes 

n  /  p  q 

S\P)=  [Z'-T'W-J-T'fo+pW-k 

t=max(p,q)+l  \  7=1  k=  1 

^  a  a  ^ 

then  the  improved  estimate  of  f3  is  (3  =  (3*  +  (3.  The  new  estimator  [3  then  has  the 
same  asymptotic  efficiency  as  the  maximum  likelihood  estimator.  In  ITSM,  however, 
we  eliminate  Step  3,  using  the  model  produced  by  Step  2  as  the  initial  model  for  the 
calculation  (by  numerical  maximization)  of  the  maximum  likelihood  estimator  itself. 


5.2  Maximum  Likelihood  Estimation 

Suppose  that  {Xr}  is  a  Gaussian  time  series  with  mean  zero  and  autocovariance  function 
K(i,j)  =  E(XiXj).  Let  X„  =  (Xu  . . . ,  Xn)'  and  let  X„  =  (V,  . . . ,  Xn)\  where  X,  =  0 

yv 

and  Xj  =  E(Xj\X\ ,  . . . ,  Xj_\)  —  Pj~\Xj ,  j  >  2.  Let  Fn  denote  the  covariance  matrix 
Fn  =  E(XnX'n ),  and  assume  that  Fn  is  nonsingular. 

The  likelihood  of  Xn  is 

UTn)  =  (27r ) _"/2 (det  T„ ) “ 1/2  exp  (-N^X^  .  (5.2.1) 

As  we  shall  now  show,  the  direct  calculation  of  detrn  and  can  be  avoided  by 

yv 

expressing  this  in  terms  of  the  one-step  prediction  errors  Xj  —  Xj  and  their  variances 
Vj-iJ  =  1,  . . . ,  n,  both  of  which  are  easily  calculated  recursively  from  the  innovations 
algorithm  (Section  2.5.4). 

Let  OijJ  =  1 =  1,2,...,  denote  the  coefficients  obtained  when  the 
innovations  algorithm  is  applied  to  the  autocovariance  function  k  of  {Xr},  and  let  Cn 
be  the  n  x  n  lower  triangular  matrix  defined  in  Section  2.5.4.  From  (2.5.27)  we  have 
the  identity 
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x„  = 


c 


n 


(5.2.2) 


We  also  know  from  Remark  5  of  Section  2.5.4  that  the  components  of  X„  —  X„  are 

/V 

uncorrelated.  Consequently,  by  the  definition  of  vy,  X„— Xn  has  the  diagonal  covariance 
matrix 


Dn  =  diag{v0,  . . . ,  vn_i}. 


From  (5.2.2)  and  (A.2.5)  we  conclude  that 


From  (5.2.2)  and  (5.2.3)  we  see  that 

/  n  2 

x;r„-1x„  =  (x„  -  X„)  D-'  (x„  -  X„)  =  £  (x,  -  Xj)  /vj—\ 

2=1 

and 


det  =  (detC„)2(detZ)„)  =  v0vi  •  •  •  v„_i. 


(5.2.3) 


(5.2.4) 


(5.2.5) 


The  likelihood  (5.2.1)  of  the  vector  X„  therefore  reduces  to 


L(Tn)  = 


(5.2.6) 


If  Fn  is  expressible  in  terms  of  a  finite  number  of  unknown  parameters  Pi, . . . ,  pr 
(as  is  the  case  when  {XJ  is  an  ARMA(p,  g)  process),  the  maximum  likelihood 
estimators  of  the  parameters  are  those  values  that  maximize  L  for  the  given  data 
set.  When  Xi,  X2,  . . . ,  X„  are  iid,  it  is  known,  under  mild  assumptions  and  for  n 
large,  that  maximum  likelihood  estimators  are  approximately  normally  distributed 
with  variances  that  are  at  least  as  small  as  those  of  other  asymptotically  normally 
distributed  estimators  (see,  e.g.,  Lehmann  1983). 

Even  if  {XJ  is  not  Gaussian,  it  still  makes  sense  to  regard  (5.2.6)  as  a  mea¬ 
sure  of  goodness  of  fit  of  the  model  to  the  data,  and  to  choose  the  parameters 
P i , . . . ,  /3r  in  such  a  way  as  to  maximize  (5.2.6).  We  shall  always  refer  to  the  estimators 

A  A 

Pi,  pr  so  obtained  as  “maximum  likelihood”  estimators,  even  when  {XJ  is  not 
Gaussian.  Regardless  of  the  joint  distribution  of  Xi,  . . . ,  Xn,  we  shall  refer  to  (5.2.1) 
and  its  algebraic  equivalent  (5.2.6)  as  the  “likelihood”  (or  “Gaussian  likelihood”) 
of  Xi,  . . . ,  Xn.  A  justification  for  using  maximum  Gaussian  likelihood  estimators  of 
ARMA  coefficients  is  that  the  large-sample  distribution  of  the  estimators  is  the  same 
for  {ZJ  ~  IID(0,  a2),  regardless  of  whether  or  not  {ZJ  is  Gaussian  (see  Brockwell 
and  Davis  (1991),  Section  10.8). 

The  likelihood  for  data  from  an  ARMA (p,  q )  process  is  easily  computed  from  the 

A 

innovations  form  of  the  likelihood  (5.2.6)  by  evaluating  the  one-step  predictors  Xi+\ 
and  the  corresponding  mean  squared  errors  V/.  These  can  be  found  from  the  recursions 
(Section  3.3) 


xw+i  =  - 


01  Xn  +  •  •  •  +  (f)pXn+i-p 


1  <  n  <  m, 


n  >  m, 


(5.2.7) 


5.2  Maximum  Likelihood  Estimation 
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and 

E  (xn+l  -  Xn+iy  =  o2E  (wn+l  -  Wn+iy  =  a2rn,  (5.2.8) 

where  0nj  and  rn  are  determined  by  the  innovations  algorithm  with  k  as  in  (3.3.3)  and 
m  =  max(/?,  q).  Substituting  in  the  general  expression  (5.2.6),  we  obtain  the  following: 


The  Gaussian  Likelihood  for  an  ARM  A  Process: 


L  {4>,  6,  cr2)  = 


1 


TTo-2)"  r0  •  •  •  r„_i 


exp 


1  " 

j= i 


Xj  ~  Xj 


0-1 


1 


(5.2.9) 


Differentiating  In  L  (</>,  0,  a2)  partially  with  respect  to  a2  and  noting  that  Xj  and  r; 

_  A  A  . 

are  independent  of  a2,  we  find  that  the  maximum  likelihood  estimators  0,  0,  and  a2 
satisfy  the  following  equations  (Problem  5.8): 


Maximum  Likelihood  Estimators: 

cr2  =  n~lS  ^0,  , 

(5.2.10) 

where 

s(le)=f2(xj-Xj)2  /rHl, 

7—1 

(5.2.11) 

and  0,  0  are  the  values  of  0,  6  that  minimize 

n 

0)  —  In  (ft-1S(0,  0))  +  n~l  y^lnr7_j. 

7=1 

(5.2.12) 

Minimization  of  t  (0,  6)  must  be  done  numerically.  Initial  values  for  0  and  6  can 
be  obtained  from  ITSM  using  the  methods  described  in  Section  5.1.  The  program  then 
searches  systematically  for  the  values  of  0  and  6  that  minimize  the  reduced  likelihood 
(5.2.12)  and  computes  the  corresponding  maximum  likelihood  estimate  of  a2  from 
(5.2.10). 


Least  Squares  Estimation  for  Mixed  Models 

The  least  squares  estimates  0  and  6  of  0  and  6  are  obtained  by  minimizing  the  function 
S  as  defined  in  (5.2.11)  rather  than  t  as  defined  in  (5.2.12),  subject  to  the  constraints 
that  the  model  be  causal  and  invertible.  The  least  squares  estimate  of  a2  is 


s  (<#>, «) 

n  —  p  —  q 


Order  Selection 

In  Section  5.1  we  introduced  minimization  of  the  AICC  value  as  a  major  criterion  for 
the  selection  of  the  orders  p  and  q.  This  criterion  is  applied  as  follows: 
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AICC  Criterion: 

Choose  p ,  q,  <fip,  and  6q  to  minimize 

AICC  =  -2  In  L((/)p,  6q ,  S(<j)p ,  0q)/n)  +  2(p  +  q+  1  )n/(n  -p-q-2). 


For  any  fixed  p  and  q  it  is  clear  that  the  AICC  is  minimized  when  <j)p  and  6q  are 
the  vectors  that  minimize  — 21n L((f)p,  0q,  S((f)p,  0q)/n ),  i.e.,  the  maximum  likelihood 
estimators.  Final  decisions  with  respect  to  order  selection  should  therefore  be  made  on 
the  basis  of  maximum  likelihood  estimators  (rather  than  the  preliminary  estimators  of 
Section  5.1,  which  serve  primarily  as  a  guide).  The  AICC  statistic  and  its  justification 
are  discussed  in  detail  in  Section  5.5. 

One  of  the  options  in  the  program  ITSM  is  Model>Estimation>Autof  it. 
Selection  of  this  option  allows  you  to  specify  a  range  of  values  for  both  p  and  q ,  after 
which  the  program  will  automatically  fit  maximum  likelihood  ARMA (p,  q)  values 
for  all  p  and  q  in  the  specified  range,  and  select  from  these  the  model  with  smallest 
AICC  value.  This  may  be  slow  if  a  large  range  is  selected  (the  maximum  range  is  from 
0  through  27  for  both  p  and  q ),  and  once  the  model  has  been  determined,  it  should 
be  checked  by  preliminary  estimation  followed  by  maximum  likelihood  estimation 
to  minimize  the  risk  of  the  fitted  model  corresponding  to  a  local  rather  than  a  global 
maximum  of  the  likelihood.  (For  more  details  see  Section  E.3.I.) 

Confidence  Regions  for  the  Coefficients 

A 

For  large  sample  size  the  maximum  likelihood  estimator  (3  of  j3  (01? . . .,  <pp, 
6\, ... ,  0qY  is  approximately  normally  distributed  with  mean  (3  and  covariance  matrix 
[n-1V(/3)]  which  can  be  approximated  by  2//-1(/3),  where  H  is  the  Hessian  matrix 

[3 2l((3)/df3idf3jYi+lv  ITSM  prints  out  the  approximate  standard  deviations  and  corre¬ 
lations  of  the  coefficient  estimators  based  on  the  Hessian  matrix  evaluated  numerically 

A 

at  (3  unless  this  matrix  is  not  positive  definite,  in  which  case  ITSM  instead  computes 
the  theoretical  asymptotic  covariance  matrix  in  Section  9.8  of  Brockwell  and  Davis 
(1991).  The  resulting  covariances  can  be  used  to  compute  confidence  bounds  for  the 
parameters. 


Large-Sample  Distribution  of  Maximum  Likelihood  Estimators: 

For  a  large  sample  from  an  ARMA (p,  q)  process, 

/3  S3  N  (/3,  n~ lV((3)) . 


The  general  form  of  V(/3 )  can  be  found  in  Brockwell  and  Davis  (1991),  Section  9.8. 
The  following  are  several  special  cases. 

Example  5.2.1  An  AR (p)  Model 

The  asymptotic  covariance  matrix  in  this  case  is  the  same  as  that  for  the  Yule-Walker 
estimates  given  by 

v(<t>)  =  a2rp. 

In  the  special  cases  p  =  1  and  p  =  2,  we  have 


5.2  Maximum  Likelihood  Estimation 
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Example  5.2.2 


Example  5.2.3 


Example  5.2.4 


AR(1)  :V(4>)  =  (1  -  4>\) , 


AR(2)  :R(0) 
An  MA(g)  Model 


!-</>!  “01  (1+02) 
-01  (1+02)  1-02 


□ 


Let  T*  be  the  covariance  matrix  of  Y\,  . . . ,  Yq,  where  {Fr}  is  the  autoregressive  process 
with  autoregressive  polynomial  Q(z),  i.e., 

Yt  +  0\Yt-\  +  •  •  •  +  eqYt_q  =  Z„  {Z,}  -  WN(0,  1). 

Then  it  can  be  shown  that 


v(G)  =  rp 1 . 

Inspection  of  the  results  of  Example  5.2.1  and  replacement  of  0;  by  — 0;  yields 
MA(1)  :V(G)  =  (1  -  9l) , 


MA(2)  :V(0)  = 


1-02  dl(l-02) 

Od  1-02)  1  ^2 


□ 


An  ARMA(1,  1)  Model 

For  a  causal  and  invertible  ARM A(  1,1)  process  with  coefficients  0  and  9. 

( i-<p2)(i  +  <pe )  -(\-e2)(\  -02)' 

—  (1  —  6>2)(1  —  02)  (1 -02)(l+</+)  ' 

□ 


V(0, 0)  = 


1+00 
(0  +  0)2 


The  Dow  Jones  Utilities  Index 

For  the  Burg  and  Yule-Walker  AR(1)  models  derived  for  the  differenced  and  mean- 
corrected  series  in  Examples  5.1.1  and  5.1.3,  the  Model >Estimation> 
Preliminary  option  of  ITSM  gives  —  2 ln(L)  =  70.330  for  the  Burg  model  and 
— 21n(L)  =  70.378  for  the  Yule-Walker  model.  Since  maximum  likelihood  estimation 
attempts  to  minimize  —  21nL,  the  Burg  estimate  appears  to  be  a  slightly  better  initial 
estimate  of  0.  We  therefore  retain  the  Burg  AR(1)  model  and  then  select  Model > 
Estimation>Max  Likelihood  and  click  OK.  The  Burg  coefficient  estimates 
provide  initial  parameter  values  to  start  the  search  for  the  minimizing  values.  The 
model  found  on  completion  of  the  minimization  is 

Yt  -  0.447 lFf_i  =  Zu  {Ztj  -  WN(0,  0.02117).  (5.2.13) 

This  model  is  different  again  from  the  Burg  and  Yule-Walker  models.  It  has 
— 21n(L)  =  70.321,  corresponding  to  a  slightly  higher  likelihood.  The  standard 

A 

error  (or  estimated  standard  deviation)  of  the  estimator  0  is  found  from  the  program  to 
be  0.1050.  This  is  close  to  the  estimated  standard  deviation  y/(\  —  (0.4471)2)/77  = 
0.1019,  based  on  the  large-sample  approximation  given  in  Example  5.2.1.  Using 
the  value  computed  from  ITSM,  approximate  95  %  confidence  bounds  for  0  are 
0.4471  d=  1.96  x  0.1050  =  (0.2413,0.6529).  These  are  quite  close  to  the  bounds 
based  on  the  Yule- Walker  and  Burg  estimates  found  in  Examples  5.1.1  and  5.1.3. 
To  find  the  minimum-AICC  model  for  the  series  {Yt}  using  ITSM,  choose  the 
option  Model>Estimation>Autof  it.  Using  the  default  range  for  both  p  and 
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q ,  and  clicking  on  Start,  we  quickly  find  that  the  minimum  AICC  ARMA (p,  q ) 
model  with  p  <  5  and  q  <  5  is  the  AR(1)  model  defined  by  (5.2.13).  The 
corresponding  AICC  value  is  74.483.  If  we  increase  the  upper  limits  for  p  and  q , 
we  obtain  the  same  result. 

□ 


Example  5.2.5  The  Lake  Data 

Using  the  option  Model  >Estimation>Autof  it  as  in  the  previous  example,  we 
find  that  the  minimum- AICC  ARMA (p,  q)  model  for  the  mean-corrected  lake  data, 
Xt  =  Yt  —  9.0041,  of  Examples  5.1.6  and  5.1.7  is  the  ARMA(1,1)  model 

X,  -  0.7446X?_!  =  Z,  +  0.3213Z,_i,  {Zt}  ~  WN(0,  0.4750).  (5.2.14) 

A  /V 

The  estimated  standard  deviations  of  the  two  coefficient  estimates  0  and  9  are  found 
from  ITSM  to  be  0.0773  and  0.1123,  respectively.  (The  respective  estimated  standard 
deviations  based  on  the  large-sample  approximation  given  in  Example  5.2.3  are  0.0788 
and  0.1119.)  The  corresponding  95  %  confidence  bounds  are  therefore 

0  :  0.7446  ±  1.96  x  0.0773  =  (0.5941,  0.8961), 

9  :  0.3208  ±  1.96  x  0.1123  =  (0.1007,  0.5409). 

The  value  of  AICC  for  this  model  is  212.77,  improving  on  the  values  for  the  prelim¬ 
inary  models  of  Examples  5.1.4,  5.1.6,  and  5.1.7. 

□ 


5.3  Diagnostic  Checking 


Typically,  the  goodness  of  fit  of  a  statistical  model  to  a  set  of  data  is  judged  by 
comparing  the  observed  values  with  the  corresponding  predicted  values  obtained  from 
the  fitted  model.  If  the  fitted  model  is  appropriate,  then  the  residuals  should  behave  in 
a  manner  that  is  consistent  with  the  model. 

When  we  fit  an  ARMA (p,  q)  model  to  a  given  series  we  determine  the  maximum 

A  A  _ 

likelihood  estimators  0,  0,  and  a2  of  the  parameters  0,  0,  and  a2.  In  the  course  of  this 
procedure  the  predicted  values  X00,  6 )  of  Xt  based  on  X\,  . . . ,  Xt_\  are  computed  for 
the  fitted  model.  The  residuals  are  then  defined,  in  the  notation  of  Section  3.3,  by 


/ 


rt- 1 


t  =  1 


n. 


(5.3.1) 


If  we  were  to  assume  that  the  maximum  likelihood  ARMA (/?,  q)  model  is  the  true 
process  generating  { Xt },  then  we  could  say  that  {Wr}  ~  WN  (0,  a2).  However, 
to  check  the  appropriateness  of  an  ARMA (p,  q)  model  for  the  data  we  should 
assume  only  that X\, ...  ,Xn  are  generated  by  an  ARMA (p,  q)  process  with  unknown 

_  y\  y\  _ 

parameters  0,  0,  and  a2,  whose  maximum  likelihood  estimators  are  0,  6 ,  and  a2, 

Wt\  is  white  noise.  Nonetheless  Wt,  t  =  1, ...  ,n, 
should  have  properties  that  are  similar  to  those  of  the  white  noise  sequence 

wt(4>,0)  =  (V-V(<M)) /(?>_!(</>,  0))1/2,  t=l,...,n. 


Moreover,  Wt(<p,  6)  approximates  the  white  noise  term  in  the  defining  equation  (5.1.1) 
in  the  sense  that  E(Wt(4>,  6)  —  Zt )2  — >►  0  as  t  oo  (Brockwell  and  Davis  (1991), 
Section  8.11).  Consequently,  the  properties  of  the  residuals  {Wr}  should  reflect  those 
of  the  white  noise  sequence  {Zt}  generating  the  underlying  ARMA (p,  q)  process.  In 
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particular,  the  sequence  {W/}  should  be  approximately  (1)  uncorrelated  if  {Zt}  ~ 
WN(0,  ct2),  (2)  independent  if  {Z,}  ~  IID(0,  ct2),  and  (3)  normally  distributed  if 
Zt  ~  N(0,  ct2). 

yv 

The  rescaled  residuals  Rt,t  =  1, . . . ,  tz,  are  obtained  by  dividing  the  residuals 

IT/,  t  —  1,  . . . ,  n,  by  the  estimate  a  =  ^})/n  °f  the  white  noise  standard 

deviation.  Thus, 

Rt  =  Wt/a.  (5.3.2) 

If  the  fitted  model  is  appropriate,  the  rescaled  residuals  should  have  properties  similar 
to  those  of  a  WN(0,  1)  sequence  or  of  an  iid(0,l)  sequence  if  we  make  the  stronger 
assumption  that  the  white  noise  {Zt}  driving  the  ARM  A  process  is  independent  white 
noise. 

The  following  diagnostic  checks  are  all  based  on  the  expected  properties  of  the 
residuals  or  rescaled  residuals  under  the  assumption  that  the  fitted  model  is  correct 
and  that  {Zt}  ~  IID(0,  a2).  They  are  the  same  tests  introduced  in  Section  1.6. 


5.3.1  The  Graph  of  {Rt,  t  =  1 , . . . ,  n} 

If  the  fitted  model  is  appropriate,  then  the  graph  of  the  rescaled  residuals  {Rt,t  = 
1,  . . . ,  n]  should  resemble  that  of  a  white  noise  sequence  with  variance  one.  While  it  is 

difficult  to  identify  the  correlation  structure  of  { Rt }  (or  any  time  series  for  that  matter) 
from  its  graph,  deviations  of  the  mean  from  zero  are  sometimes  clearly  indicated  by 

y\ 

a  trend  or  cyclic  component  and  nonconstancy  of  the  variance  by  fluctuations  in  Rt , 
whose  magnitude  depends  strongly  on  t. 

The  rescaled  residuals  obtained  from  the  ARM A(  1,1)  model  fitted  to  the  mean- 
corrected  lake  data  in  Example  5.2.5  are  displayed  in  Figure  5-5.  The  graph  gives  no 
indication  of  a  nonzero  mean  or  nonconstant  variance,  so  on  this  basis  there  is  no 

A  A 

reason  to  doubt  the  compatibility  of  R\, . . . ,  Rn  with  unit- variance  white  noise. 


Figure  5-5 

The  rescaled  residuals  after 
fitting  the  ARMA(1 ,1 )  model 
of  Example  5.2.5  to  the  lake 

data 
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Figure  5-6 

The  sample  ACF  of 
the  residuals  after 
fitting  the  ARMA(1 ,1 ) 
model  of  Example  5.2.5  to 
the  lake  data 


Lag 

The  next  step  is  to  check  that  the  sample  autocorrelation  function  of  { }  (or 

equivalently  of  {Rj)  behaves  as  it  should  under  the  assumption  that  the  fitted  model 
is  appropriate. 


5.3.2  The  Sample  ACF  of  the  Residuals 

We  know  from  Section  1.6  that  for  large  n  the  sample  autocorrelations  of 
an  iid  sequence  Y\,  ...  ,Yn  with  finite  variance  are  approximately  iid  with  distribution 
N(0,  1  /n).  We  can  therefore  test  whether  or  not  the  observed  residuals  are  consistent 
with  iid  noise  by  examining  the  sample  autocorrelations  of  the  residuals  and  rejecting 
the  iid  noise  hypothesis  if  more  than  two  or  three  out  of  40  fall  outside  the  bounds 
±1.96 /  *Jn  or  if  one  falls  far  outside  the  bounds.  (As  indicated  above,  our  estimated 
residuals  will  not  be  precisely  iid  even  if  the  true  model  generating  the  data  is  as 
assumed.  To  correct  for  this  the  bounds  ±1.96 / y/n  should  be  modified  to  give  a  more 
precise  test  as  in  Box  and  Pierce  (1970)  and  Brockwell  and  Davis  (1991),  Section  9.4.) 
The  sample  ACF  and  PACF  of  the  residuals  and  the  bounds  ±1.96/ ^/n  can  be  viewed 
by  pressing  the  second  green  button  (Plot  ACF/PACF  of  residuals)  at  the 
top  of  the  ITSM  window.  Figure  5-6  shows  the  sample  ACF  of  the  residuals  after 
fitting  the  ARM A(  1,1)  of  Example  5.2.5  to  the  lake  data.  As  can  be  seen  from  the 
graph,  there  is  no  cause  to  reject  the  fitted  model  on  the  basis  of  these  autocorrelations. 


5.3.3  Tests  for  Randomness  of  the  Residuals 

The  tests  (b),  (c),  (d),  (e),  and  (f)  of  Section  1.6  can  be  carried  out  using  the 
program  ITSM  by  selecting  Statistics>Residual  Analysis>Tests  of 
Randomness. 

Applying  these  tests  to  the  residuals  from  the  ARM A(  1,1)  model  for  the  mean- 
corrected  lake  data  (Example  5.2.5),  and  using  the  default  value  h  =  22  suggested 
for  the  portmanteau  tests,  we  obtain  the  following  results: 
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RANDOMNESS  TEST  STATISTICS 
LJUNG-BOX  PORTM.  =  10.23  CHISQUR(20)  p=0.964 
MCLEOD-LI  PORTM.  =  16.55  CHISQUR(22)  p=0.788 
TURNING  POINTS  =  69  ANORMAL(64.0,  4. 14**2)  p=0.227 
DIFFERENCE-SIGN  =  50  ANORMAL(48.5,  2.87**2)  p=0.602 
RANK  TEST  =  2083  ANORMAL(2376,  488.7**2)  p=0.072 
JARQUE-BERA=0.285  CHISQUR(2)  p=0.867 
ORDER  OF  MIN  AICC  YW  MODEL  FOR  RESIDUALS  =  0 

This  table  shows  the  observed  values  of  the  statistics  defined  in  Section  1.6,  with  each 
followed  by  its  large-sample  distribution  under  the  null  hypothesis  of  iid  residuals, 
and  the  corresponding  p-values.  The  observed  values  can  thus  be  checked  easily  for 
compatibility  with  their  distributions  under  the  null  hypothesis.  Since  all  of  the  p- 
values  are  greater  than  0.05,  none  of  the  test  statistics  leads  us  to  reject  the  null 
hypothesis  at  this  level.  The  order  of  the  minimum  AICC  autoregressive  model  for 
the  residuals  also  suggests  the  compatibility  of  the  residuals  with  white  noise. 

A  rough  check  for  normality  is  provided  by  visual  inspection  of  the  histogram 
of  the  rescaled  residuals,  obtained  by  selecting  the  third  green  button  at  the  top  of  the 
ITSM  window.  A  Gaussian  qq-plot  of  the  residuals  can  also  be  plotted  by  selecting 
Statistics  >  Residual  Analysis  >  QQ-Plot  (normal ).  No  obvi¬ 
ous  deviation  from  normality  is  apparent  in  either  the  histogram  or  the  qq-plot.  The 
Jarque-Bera  statistic,  n[m2/(6m\)  +  (m4/m]—3)2/24],  where  mr  —  Y)r/n, 

distributed  asymptotically  as  x2(2)  if  {Yt}  ~  IID  N(/z,  a2).  This  hypothesis  is  rejected 
if  the  statistic  is  sufficiently  large  (at  level  a  if  the  p-  value  of  the  test  is  less  than  a).  In 
this  case  the  large  /?-value  computed  by  ITSM  provides  no  evidence  for  rejecting  the 
normality  hypothesis. 


5.4  Forecasting 


Once  a  model  has  been  fitted  to  the  data,  forecasting  future  values  of  the  time  series 
can  be  carried  out  using  the  method  described  in  Section  3.3.  We  illustrate  this  method 
with  one  of  the  examples  from  Section  3.2. 


Example  5.4.1  For  the  overshort  data  {Xr}  of  Example  3.2.8,  selection  of  the  options  Model> 


Estimation  Preliminary,  the  innovations  algorithm,  and  then  Model> 
Estimation>Max  likelihood,  gives  the  maximum  likelihood  MA(1)  model 


for  {Xt}, 


Xt  +  4.035  =Zt-  0.818Z,_i,  {Zt}  ~  WN(0,  2040.75). 


(5.4.1) 


To  predict  the  next  7  days  of  overshorts,  we  treat  (5.4.1)  as  the  true  model  for  the  data, 
and  use  the  results  of  Example  3.3.3  with  0  =  0.  From  (3.3.11),  the  predictors  are 
given  by 


if  h=  1 


if  h  >  1 
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Table  5.1  Forecasts  of  the  next  seven  observations 

of  the  overshort  data  of  Example  3.2.8 
using  model  (5.4.1 ) 


# 

XHAT 

SQRT (MSE) 

XHAT  +  MEAN 

58 

1.0097 

45.1753 

-3.0254 

59 

0.0000 

58.3602 

-4.0351 

60 

0.0000 

58.3602 

-4.0351 

61 

0.0000 

58.3602 

-4.0351 

62 

0.0000 

58.3602 

-4.0351 

63 

0.0000 

58.3602 

-4.0351 

64 

0.0000 

58.3602 

-4.0351 

with  mean  squared  error 


E(X57+h  —  P  51^51 +h)  — 


2040.75r57, 

2040.75(1  +  (— 0.818)2), 


if  h=  1, 
if  h  >  1, 


where  #574  and  r 57  are  computed  recursively  from  (3.3.9)  with  6  —  —0.818. 

These  calculations  are  performed  with  ITSM  by  fitting  the  maximum  likeli¬ 
hood  model  (5.4.1),  selecting  Forecasting>ARMA,  and  specifying  the  number  of 
forecasts  required.  The  1-step,  2-step,  . . . ,  and  7-step  forecasts  of  Xt  are  shown  in 
Table  5.1.  Notice  that  the  predictor  of  Xt  for  t  >  59  is  equal  to  the  sample  mean,  since 
under  the  MA(1)  model  { Xt ,  t  >  59}  is  uncorrelated  with  f Xt ,  t  <  57}. 

Assuming  that  the  innovations  {Zt}  are  normally  distributed,  an  approximate  95  % 
prediction  interval  for  X64  is  given  by 


-4.0351  ±  1.96  x  58.3602  =  (-118.42,  110.35). 


□ 


The  mean  squared  errors  of  prediction,  as  computed  in  Section  3.3  and  the  example 
above,  are  based  on  the  assumption  that  the  fitted  model  is  in  fact  the  true  model  for 
the  data.  As  a  result,  they  do  not  reflect  the  variability  in  the  estimation  of  the  model 
parameters.  To  illustrate  this  point,  suppose  the  data  X\, . . . ,  Xn  are  generated  from 
the  causal  AR(1)  model 


Xt  =  4>Xt^  +  Zt,  {Zt}  ~  iid  (0,  a2) . 

✓V 

If  </>  is  the  maximum  likelihood  estimate  of  </>,  based  on  X\, . . . ,  Xn,  then  the  one-step 

A 

ahead  forecast  of  Xn+\  is  (j)Xn ,  which  has  mean  squared  error 

E  (xn+l  -  =  E  ((<£  -  4)  Xn  +  Z„+1)2  =  E((<p  -  4>)Xn)2  +  a2. 

(5.4.2) 


The  second  equality  follows  from  the  independence  of  Zn+i  and  ()/>.  Xn  \  .  To  evaluate 
the  first  term  in  (5.4.2),  first  condition  on  Xn  and  then  use  the  approximations 

E  ((</>  -  4>)2  |X„J  «  E  (0  -  if  «  (1  -  4>2)  In, 

A 

where  the  second  relation  comes  from  the  formula  for  the  asymptotic  variance  of  0 
given  by  o'2T1_1  =  (l  —  </>2)  (see  Example  5.2.1).  The  one-step  mean  squared  error  is 
then  approximated  by 
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E  ^0  —  0^  EX2  +  o2  ~  n  1  (l  —  02)  (l  —  <fi2)  1  a2  a2  =  a2. 

Thus,  the  error  in  parameter  estimation  contributes  the  term  cr2 /n  to  the  mean  squared 
error  of  prediction.  If  the  sample  size  is  large,  this  factor  is  negligible,  and  so  for  the 
purpose  of  mean  squared  error  computation,  the  estimated  parameters  can  be  treated 
as  the  true  model  parameters.  On  the  other  hand,  for  small  sample  sizes,  ignoring 
parameter  variability  can  lead  to  a  severe  underestimate  of  the  actual  mean  squared 
error  of  the  forecast. 


5.5  Order  Selection 

Once  the  data  have  been  transformed  (e.g.,  by  some  combination  of  Box-Cox  and 
differencing  transformations  or  by  removal  of  trend  and  seasonal  components)  to  the 
point  where  the  transformed  series  {Xr}  can  potentially  be  fitted  by  a  zero-mean  ARM  A 
model,  we  are  faced  with  the  problem  of  selecting  appropriate  values  for  the  orders  p 
and  q. 

It  is  not  advantageous  from  a  forecasting  point  of  view  to  choose  p  and  q  arbi¬ 
trarily  large.  Fitting  a  very  high  order  model  will  generally  result  in  a  small  estimated 
white  noise  variance,  but  when  the  fitted  model  is  used  for  forecasting,  the  mean 
squared  error  of  the  forecasts  will  depend  not  only  on  the  white  noise  variance  of 
the  fitted  model  but  also  on  errors  arising  from  estimation  of  the  parameters  of  the 
model  (see  the  paragraphs  following  Example  5.4.1).  These  will  be  larger  for  higher- 
order  models.  For  this  reason  we  need  to  introduce  a  “penalty  factor”  to  discourage 
the  fitting  of  models  with  too  many  parameters. 

Many  criteria  based  on  such  penalty  factors  have  been  proposed  in  the  literature, 
since  the  problem  of  model  selection  arises  frequently  in  statistics,  particularly  in 
regression  analysis.  We  shall  restrict  attention  here  to  a  brief  discussion  of  the  FPE, 
AIC,  and  BIC  criteria  of  Akaike  and  a  bias-corrected  version  of  the  AIC  known  as  the 
AICC. 


5.5.1  The  FPE  Criterion 

The  FPE  criterion  was  developed  by  Akaike  (1969)  to  select  the  appropriate  order  of 
an  AR  process  to  fit  to  a  time  series  {X\ ,  . . . ,  Xn}.  Instead  of  trying  to  choose  the  order 
p  to  make  the  estimated  white  noise  variance  as  small  as  possible,  the  idea  is  to  choose 
the  model  for  {XJ  in  such  a  way  as  to  minimize  the  one- step  mean  squared  error  when 
the  model  fitted  to  {X,}  is  used  to  predict  an  independent  realization  {Yt}  of  the  same 
process  that  generated  {Xr}. 

Suppose  then  that  {X\,  . . . ,  Xn }  is  a  realization  of  an  AR (p)  process  with  coef¬ 
ficients  0i, . . . ,  4>p,  p  <  n ,  and  that  {Fi, . . . ,  Yn)  is  an  independent  realization  of  the 

A  A 

same  process.  If  0i , . . . ,  4>p,  are  the  maximum  likelihood  estimators  of  the  coefficients 

A 

based  on  {X\ ,  . . . ,  Xn }  and  if  we  use  these  to  compute  the  one-step  predictor  (f)\Yn  + 

A 

•  •  •  +  (ppYn+i_p  of  Yn+ 1,  then  the  mean  square  prediction  error  is 

E  (V„+l  01  Yn  •  •  •  (frpYn+i—p^ 

Yn  ’  ’  ’  (&p  0p^  Yn-\- 1  —p 

(&p  ~  0p^  \Yn+ 1  —i  F/2+ 1  —j\  i  j=  i  (^p  —  0^ 


—  E  Yn-\- 1  0i  Yn  •  •  •  4>p F/2+ 1  —p  ^0i  0i^ 


=  cr2  +£ 
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Table  5.2  a2  and  FPEp  for  AR(p) 

models  fitted  to  the 
lake  data 


p 

°p 

FPEp 

0 

1.7203 

1.7203 

1 

0.5097 

0.5202 

2 

0.4790 

0.4989 

3 

0.4728 

0.5027 

4 

0.4708 

0.5109 

5 

0.4705 

0.5211 

6 

0.4705 

0.5318 

7 

0.4679 

0.5399 

8 

0.4664 

0.5493 

9 

0.4664 

0.5607 

10 

0.4453 

0.5465 

where  <p'p  =  (0i, . . . ,  (pp)',  (fip  =  ^0i, . . . ,  ,  and  a2  is  the  white  noise  variance 

of  the  AR (p)  model.  Writing  the  last  term  in  the  preceding  equation  as  the  expecta¬ 
tion  of  the  conditional  expectation  given  X\, ... ,  Xn,  and  using  the  independence  of 
{X\,  . . . ,  Xn}  and  {Y\, . . . ,  Yn },  we  obtain 

E  (Yn+ 1  —  01  Yn  —  •  •  •  —  —  <J2  +  E 

where  Vp  =  E\ YjYj]!lj=l.  We  can  approximate  the  last  term  by  assuming  that  the 

random  variable  n~l/1  ^ <frp  —  has  its  large-sample  distribution  N(0,  cr2r~l)  as 
given  in  Example  5.21.  Using  Problem  5.13,  we  then  find  that 

e(yh+1  -4>xYn - 4>PYn+i-py  «  a2  (l  +  0  .  (5.5.1) 

If  a2  is  the  maximum  likelihood  estimator  of  a2,  then  for  large  n ,  no2 /cr2  is  distributed 
approximately  as  chi-squared  with  ( n  —  p)  degrees  of  freedom  (see  Brockwell  and 
Davis  (1991),  Section  8.9).  We  therefore  replace  a2  in  (5.5.1)  by  the  estimator 
no2 /{n  —  p)  to  get  the  estimated  mean  square  prediction  error  of  Yn+  \ , 

FPE,  =  <t2 - -.  (5.5.2) 

n  —  p 

To  apply  the  FPE  criterion  for  autoregressive  order  selection  we  therefore  choose  the 
value  of  p  that  minimizes  FPE^  as  defined  in  (5.5.2). 

Example  5.5.1  FPE-Based  Selection  of  an  AR  Model  for  the  Lake  Data 

In  Example  5.1.4  we  fitted  AR(2)  models  to  the  mean-corrected  lake  data,  the  order  2 
being  suggested  by  the  sample  PACF  shown  in  Figure  5-4.  To  use  the  FPE  criterion  to 
select  p ,  we  have  shown  in  Table  5.2  the  values  of  FPE  for  values  of  p  from  0  to  10. 
These  values  were  found  using  ITSM  by  fitting  maximum  likelihood  AR  models  with 
the  option  Model>Estimation>Max  likelihood.  Also  shown  in  the  table 
are  the  values  of  the  maximum  likelihood  estimates  of  a2  for  the  same  values  of  p. 
Whereas  a2  decreases  steadily  with  p ,  the  values  of  FPE^  have  a  clear  minimum  at 
p  =  2,  confirming  our  earlier  choice  of  p  =  2  as  the  most  appropriate  for  this  data  set. 

□ 
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5.5.2  The  AICC  Criterion 


A  more  generally  applicable  criterion  for  model  selection  than  the  FPE  is  the  infor¬ 
mation  criterion  of  Akaike  (1973),  known  as  the  AIC.  This  was  designed  to  be  an 
approximately  unbiased  estimate  of  the  Kullback-Leibler  index  of  the  fitted  model 
relative  to  the  true  model  (defined  below).  Here  we  use  a  bias-corrected  version  of  the 
AIC,  referred  to  as  the  AICC,  suggested  by  Hurvich  and  Tsai  (1989). 

If  X  is  an  ^-dimensional  random  vector  whose  probability  density  belongs  to 
the  family  {/(•;  \[/),  i/s  e  4>},  the  Kullback-Leibler  discrepancy  between /(•;  VO  and 
/(•;  9)  is  defined  as 

d(x//\9)  =  A(x//\ 0)  -  A(0|0), 


where 


AOAI0)  =  £,(-21n/(X;  VO)  =  -21n(f(x;  VO)/(x;  0)  dx 


f 

jRn 


is  the  Kullback-Leibler  index  of/(-;  VO  relative  to  /(•;  0).  (Note  that  in  general, 
A(t/c 1 0)  zfz  A(0|VO*)  By  Jensen’s  inequality  (see,  e.g.,  Mood  et  al.,  1974), 

“<*'«>= i,~2'  a{jBr)fMdx 


>  —2  In 


=  —2  In 


01 

(I 


/(x;  %/r) 


r»  /(x;  6) 


•/(x;  6)  dx 


/(x;  if)  dx 


=  0. 


with  equality  holding  if  and  only  if /(x;  VO  =/(x;  0). 

Given  observations  X\,  . . . ,  Xn  of  an  ARMA  process  with  unknown  parameters 
9  —  (/3,  a2),  the  true  model  could  be  identified  if  it  were  possible  to  compute  the 
Kullback-Leibler  discrepancy  between  all  candidate  models  and  the  true  model.  Since 
this  is  not  possible,  we  estimate  the  Kullback-Leibler  discrepancies  and  choose  the 
model  whose  estimated  discrepancy  (or  index)  is  minimum.  In  order  to  do  this,  we 
assume  that  the  true  model  and  the  alternatives  are  all  Gaussian.  Then  for  any  given 
9  —  (/3,  o'2),/(-;  0)  is  the  probability  density  of  (Y\,  . . . ,  Yn)\  where  {Yt}  is  a  Gaussian 
ARMA(/?,  q )  process  with  coefficient  vector  f3  and  white  noise  variance  a2.  (The 
dependence  of  6  onp  and  q  is  through  the  dimension  of  the  autoregressive  and  moving- 
average  coefficients  in  (3.) 

Suppose,  therefore,  that  our  observations  X\,  . . . ,  Xn  are  from  a  Gaussian  ARMA 
process  with  parameter  vector  9  =  (/3,  a2)  and  assume  for  the  moment  that  the  true 

order  is  (p,  q).  Let  9  =  0,  a2)  be  the  maximum  likelihood  estimator  of  9  based  on 
Xi,  ,  Xn  and  let  Y\, . . . ,  Yn  be  an  independent  realization  of  the  true  process  (with 
parameter  9).  Then 

— 21n LY  (j. 3,  <j2^  =  — 21nLx  (j3,  o2^  +  g~2Sy  —  n , 
where  Lx ,  LY ,  Sx,  and  SY  are  defined  as  in  (5.2.9)  and  (5.2.11).  Hence, 

Ee(A(§\d))  =  E^a2  (—2 In LY  (p,Z2)) 
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^—2  In  Lx 


-  2 

<7 


+  E(3,cj2 


(*)\ 


—  n. 


(5.5.3) 


It  can  be  shown  using  large-sample  approximations  (see  Brockwell  and  Davis  (1991), 
Section  10.3  for  details)  that 


2{p  +  q  +  l)n 

n  —  p  —  q  —  2’ 


from  which  we  see  that  —2 In Lx(f3,  o'2)  +  2 (p  +  q  +  1  )n/(n  —  p  —  q  —  2)  is  an  ap¬ 
proximately  unbiased  estimator  of  the  expected  Kullback-Leibler  index  Ee(A(§\6)) 
in  (5.5.3).  Since  the  preceding  calculations  (and  the  maximum  likelihood  estimators 
(3  and  a2)  are  based  on  the  assumption  that  the  true  order  is  (p ,  q ),  we  therefore  select 
the  values  of  p  and  q  for  our  fitted  model  to  be  those  that  minimize  AICC(/3),  where 


AICCO 3)  :=  — 21nLx(/3,  Sx(j3)/n)  +  2 (p  +  q+  1  )n/(n  -p-q-2). 

(5.5.4) 


The  AIC  statistic,  defined  as 


AIC(/3)  :=  — 21nLx(/3,  5x(/3)/n)  +  2(p  +  q  +  1), 

can  be  used  in  the  same  way.  Both  AICC(/3,  a2)  and  AIC(/3,  a2)  can  be  defined 
for  arbitrary  a2  by  replacing  Sx(j3)/n  in  the  preceding  definitions  by  a2.  The  value 
Sx{(3 )/n  is  used  in  (5.5.4),  since  AICC(/3,  a2)  (like  AIC(/3,  a2))  is  minimized  for  any 
given  (3  by  setting  a2  —  Sx(/3)/n. 

For  fitting  autoregressive  models,  Monte  Carlo  studies  (Jones  1975;  Shibata  1976) 
suggest  that  the  AIC  has  a  tendency  to  overestimate  p.  The  penalty  factors  2 (p  +  q  + 
\)n/(n—p  —  q  —  2)  and  2(p+q-\- 1)  for  the  AICC  and  AIC  statistics  are  asymptotically 
equivalent  as  n  — >  oo.  The  AICC  statistic,  however,  has  a  more  extreme  penalty  for 
large-order  models,  which  counteracts  the  overfitting  tendency  of  the  AIC.  The  BIC 
is  another  criterion  that  attempts  to  correct  the  overfitting  nature  of  the  AIC.  For  a 
zero-mean  causal  invertible  ARMA (p,  q)  process,  it  is  defined  (Akaike  1978)  to  be 

BIC  =  (n  —  p  —  q)  In  \no2 / (n  —  p  —  g)]  +  n  ^1  +  In  V2n^j 


+(p  +  q)  In 


/(P  +  q) 


(5.5.5) 


where  a2  is  the  maximum  likelihood  estimate  of  the  white  noise  variance. 

The  BIC  is  a  consistent  order- selection  criterion  in  the  sense  that  if  the  data 
{X\ ,  . . . ,  Xn}  are  in  fact  observations  of  an  ARMA (p,  q)  process,  and  if  p  and  q  are 
the  estimated  orders  found  by  minimizing  the  BIC,  then  p  p  and  q  — >  q  with 
probability  1  as  n  oo  (Hannan  1980).  This  property  is  not  shared  by  the  AICC  or 
AIC.  On  the  other  hand,  order  selection  by  minimization  of  the  AICC,  AIC,  or  FPE 
is  asymptotically  efficient  for  autoregressive  processes,  while  order  selection  by  BIC 
minimization  is  not  (Shibata  1980;  Hurvich  and  Tsai  1989).  Efficiency  is  a  desirable 
property  defined  in  terms  of  the  one- step  mean  square  prediction  error  achieved  by  the 
fitted  model.  For  more  details  see  Brockwell  and  Davis  (1991),  Section  10.3. 
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Example  5.5.2 


Problems 


In  the  modeling  of  real  data  there  is  rarely  such  a  thing  as  the  “true  order.”  For 
the  process  Xt  =  0  there  may  be  many  polynomials  0(z),  0(z)  such  that  the 

coefficients  of  z7  in  9  (z)  /0  (z)  closely  approximate  0;  for  moderately  small  values  of  j. 
Correspondingly,  there  may  be  many  ARM  A  processes  with  properties  similar  to  {Yr}. 
This  problem  of  identifiability  becomes  much  more  serious  for  multivariate  processes. 
The  AICC  criterion  does,  however,  provide  us  with  a  rational  criterion  for  choosing 
among  competing  models.  It  has  been  suggested  (Duong  1984)  that  models  with  AIC 
values  within  c  of  the  minimum  value  should  be  considered  competitive  (with  c  =  2 
as  a  typical  value).  Selection  from  among  the  competitive  models  can  then  be  based 
on  such  factors  as  whiteness  of  the  residuals  (Section  5.3)  and  model  simplicity. 

We  frequently  need,  particularly  in  analyzing  seasonal  data,  to  fit  ARM  A  (p,  q ) 
models  in  which  all  except  m(<  p  +  q)  of  the  coefficients  are  constrained  to  be  zero. 
In  such  cases  the  definition  (5.5.4)  is  replaced  by 

AICC(/3)  :=  -21nLx(/3,  Sx(j3)/n )  +  2 (m  +  1  )n/(n  -  m  -  2).  (5.5.6) 

Models  for  the  Lake  Data 

In  Example  5.2.4  we  found  that  the  minimum- AICC  ARMA (p,  q)  model  for  the  mean- 
corrected  lake  data  is  the  ARMA(1,1)  model  (5.2.14).  For  this  model  ITSM  gives  the 
values  AICC  =  212.77  and  BIC  =  216.86.  A  systematic  check  on  ARMA (p,  q)  mod¬ 
els  for  other  values  of  p  and  q  shows  that  the  model  (5.2.14)  also  minimizes  the  BIC 
statistic.  The  minimum- AICC  AR (p)  model  is  found  to  be  the  AR(2)  model  satisfying 

-  1.0441Xr_!  +  0.2503X,_2  =  Z„  {Zt}  ~  WN(0,  0.4789), 

with  AICC  =  213.54  and  BIC  =  217.63.  Both  the  AR(2)  and  ARMA(1,1)  models 
pass  the  diagnostic  checks  of  Section  5.3,  and  in  view  of  the  small  difference  between 
the  AICC  values  there  is  no  strong  reason  to  prefer  one  model  or  the  other. 

□ 


5.1  The  sunspot  numbers  { Xt ,  t  =  1,  . . . ,  100},  filed  as  SUNSPOTS. TSM,  have 
sample  autocovariances  y(0)  =  1382.2,  y(  1)  =  1114.4,  y( 2)  =  591.73,  and 
y  (3)  =  96.216.  Use  these  values  to  find  the  Yule-Walker  estimates  of  0 1,  02, 
and  a2  in  the  model 

Y,  =  0,  y;_,  +  <p2Yt-2  +  Zt,  {Zt}  ~  WN(0,  a2), 

for  the  mean-corrected  series  Yt  —  Xt  —  46.93,  t  —  1, . . . ,  100.  Assuming 
that  the  data  really  are  a  realization  of  an  AR(2)  process,  find  95  %  confidence 
intervals  for  0i  and  02. 

5.2  From  the  information  given  in  the  previous  problem,  use  the  Durbin-Levinson 

A  A  A 

algorithm  to  compute  the  sample  partial  autocorrelations  0 n,  022,  and  0 33  of  the 

/V 

sunspot  series.  Is  the  value  of  033  compatible  with  the  hypothesis  that  the  data 
are  generated  by  an  AR(2)  process?  (Use  significance  level  0.05.) 
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5.3  Consider  the  AR(2)  process  {X,}  satisfying 

Xt  -  (j)Xf—\  -  (P2Xt_2  =  Zt,  {Ztj  ~  WN(0,  a2). 

a.  For  what  values  of  0  is  this  a  causal  process? 

b.  The  following  sample  moments  were  computed  after  observing  X), ... ,X2oo • 

y  (0)  =  6.06,  p(  1)  =  0.687. 

Find  estimates  of  0  and  a2  by  solving  the  Yule-Walker  equations.  (If  you 

find  more  than  one  solution,  choose  the  one  that  is  causal.) 

5.4  Two  hundred  observations  of  a  time  series,  X\,  . . . ,  X200,  gave  the  following 

sample  statistics: 

sample  mean:  T200  =  3.82; 

sample  variance :  y  (0)  =  1 . 1 5 ; 

sample  ACF:  p(l)  =  0.427; 

p(2)  =  0.475; 
p(3)  =  0.169. 

a.  Based  on  these  sample  statistics,  is  it  reasonable  to  suppose  that  { Xt  —  /x}  is 
white  noise? 

b.  Assuming  that  {Xt  —  /x}  can  be  modeled  as  the  AR(2)  process 

xt  -  /X  -  0i(X,_!  -  fi)  -  02(Yr_2  -  IL)  =  zr, 
where  { Zt)  ~  IID(0,  a2),  find  estimates  of  /x,  0i,  02,  and  o'2. 

c.  Would  you  conclude  that  /x  =  0? 

d.  Construct  95  %  confidence  intervals  for  0 1  and  02. 

e.  Assuming  that  the  data  were  generated  from  an  AR(2)  model,  derive  esti¬ 
mates  of  the  PACF  for  all  lags  h  >  1 . 

5.5  Use  the  program  ITSM  to  simulate  and  file  20  realizations  of  length  200  of  the 

Gaussian  MA(1)  process 

xt  =  zt  +  ezt_u  {zt}  ~  wn(0,  l), 

with  6  —  0.6. 

a.  For  each  series  find  the  moment  estimate  of  0  as  defined  in  Example  5.1.2. 

b.  For  each  series  use  the  innovations  algorithm  in  the  ITSM  option  Model> 
Estimation>Preliminary  to  find  an  estimate  of  9.  (Use  the  default 
value  of  the  parameter  m.)  As  soon  as  you  have  found  this  preliminary 
estimate  for  a  particular  series,  select  Model >Estimation>Max 
likelihood  to  find  the  maximum  likelihood  estimate  of  6  for  the  series. 

c.  Compute  the  sample  means  and  sample  variances  of  your  three  sets  of  esti¬ 
mates. 

d.  Use  the  asymptotic  formulae  given  at  the  end  of  Section  5. 1. 1  (with  n  — 
200)  to  compute  the  variances  of  the  moment,  innovation,  and  maximum 
likelihood  estimators  offl  Compare  with  the  corresponding  sample  variances 
found  in  (c). 

e.  What  do  the  results  of  (c)  suggest  concerning  the  relative  merits  of  the  three 
estimators? 
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5.6  Establish  the  recursions  (5.1.19)  and  (5.1.20)  for  the  forward  and  backward 
prediction  errors  U[ (t)  and  v*(0  in  Burg’s  algorithm. 

5.7  Derive  the  recursions  for  the  Burg  estimates  0/(/)>)  and 

5.8  From  the  innovation  form  of  the  likelihood  (5.2.9)  derive  the  equations  (5.2.10), 
(5.2.11),  and  (5.2.12)  for  the  maximum  likelihood  estimators  of  the  parameters 
of  an  ARMA  process. 

5.9  Use  equation  (5.2.9)  to  show  that  for  n  >  p,  the  likelihood  of  the  observations 
{Xi,  . . . ,  Xn }  of  the  causal  AR (p)  process  defined  by 

Xr  =  +  •  •  •  +  (j)pXt_p  +  Z„  {Ztj  ~  WN  (0,  a2)  , 

is 

L  (0,  a2)  =  (27ro'2)  n' '2  (det  Gp)~1^2 

n 

X'pGp  lXp  +  (Xt  —  (piXt-i  —  •  •  •  —  (ppXt_p 

t=P+ 1 

where  Xp  =  (X\, . . . ,  Xp)f  and  Gp  =  cr~2rp  =  cr~2E(XpX/p). 

5.10  Use  the  result  of  Problem  5.9  to  derive  a  pair  of  linear  equations  for  the  least 
squares  estimates  of  0i  and  02  for  a  causal  AR(2)  process  (with  mean  zero). 
Compare  your  equations  with  those  for  the  Yule- Walker  estimates.  (Assume  that 
the  mean  is  known  to  be  zero  in  writing  down  the  latter  equations,  so  that  the 
sample  autocovariances  are  y(h)  —  ±  Y^t=\Xt+kXt  for  h  >  0.) 

5.11  Given  two  observations  x\  and  X2  from  the  causal  AR(1)  process  satisfying 

Xr  =  <pxt_x  +  Z,,  {Ztj  ~  WN  (0,  cr2)  , 

and  assuming  that  |jci|  ^  |x2|,  find  the  maximum  likelihood  estimates  of  0 
and  a2. 

5.12  Derive  a  cubic  equation  for  the  maximum  likelihood  estimate  of  the  coefficient 
0  of  a  causal  AR(1)  process  based  on  the  observations  X\,  . . . ,  Xn. 

5.13  Use  the  result  of  Problem  A. 7  and  the  approximate  large-sample  normal  distri- 

yv 

bution  of  the  maximum  likelihood  estimator  <f>p  to  establish  the  approximation 
(5.5.1). 


Nonstationary  and  Seasonal 
Time  Series  Models 


6.1  ARIMA  Models  for  Nonstationary  Time  Series 

6.2  Identification  Techniques 

6.3  Unit  Roots  in  Time  Series  Models 

6.4  Forecasting  ARIMA  Models 

6.5  Seasonal  ARIMA  Models 

6.6  Regression  with  ARMA  Errors 


In  this  chapter  we  shall  examine  the  problem  of  finding  an  appropriate  model  for  a 
given  set  of  observations  [x\ ,  . . . ,  xn]  that  are  not  necessarily  generated  by  a  stationary 
time  series.  If  the  data  (a)  exhibit  no  apparent  deviations  from  stationarity  and  (b)  have 
a  rapidly  decreasing  autocovariance  function,  we  attempt  to  fit  an  ARMA  model  to  the 
mean-corrected  data  using  the  techniques  developed  in  Chapter  5.  Otherwise,  we  look 
first  for  a  transformation  of  the  data  that  generates  a  new  series  with  the  properties 
(a)  and  (b).  This  can  frequently  be  achieved  by  differencing,  leading  us  to  consider 
the  class  of  ARIMA  (autoregressive  integrated  moving-average)  models,  defined  in 
Section  6.1.  We  have  in  fact  already  encountered  ARIMA  processes.  The  model  fitted 
in  Example  5. 1. 1  to  the  Dow  Jones  Utilities  Index  was  obtained  by  fitting  an  AR  model 
to  the  differenced  data,  thereby  effectively  fitting  an  ARIMA  model  to  the  original 
series.  In  Section  6.1  we  shall  give  a  more  systematic  account  of  such  models. 

In  Section  6.2  we  discuss  the  problem  of  finding  an  appropriate  transformation  for 
the  data  and  identifying  a  satisfactory  ARMA (/?,  q )  model  for  the  transformed  data. 
The  latter  can  be  handled  using  the  techniques  developed  in  Chapter  5.  The  sample 

/V  A 

ACF  and  PACF  and  the  preliminary  estimators  </>m  and  9m  of  Section  5.1  can  provide 
useful  guidance  in  this  choice.  However,  our  prime  criterion  for  model  selection  will 
be  the  AICC  statistic  discussed  in  Section  5.5.2.  To  apply  this  criterion  we  compute 
maximum  likelihood  estimators  of  </>,  9,  and  a2  for  a  variety  of  competing  p  and  q 
values  and  choose  the  fitted  model  with  smallest  AICC  value.  Other  techniques,  in 
particular  those  that  use  the  R  and  S  arrays  of  Gray  et  al.  (1978),  are  discussed  in 
the  survey  of  model  identification  by  de  Gooijer  et  al.  (1985).  If  the  fitted  model  is 
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satisfactory,  the  residuals  (see  Section  5.3)  should  resemble  white  noise.  Tests  for  this 
were  described  in  Section  5.3  and  should  be  applied  to  the  minimum  AICC  model 
to  make  sure  that  the  residuals  are  consistent  with  their  expected  behavior  under  the 
model.  If  they  are  not,  then  competing  models  (models  with  AICC  value  close  to  the 
minimum)  should  be  checked  until  we  find  one  that  passes  the  goodness  of  fit  tests.  In 
some  cases  a  small  difference  in  AICC  value  (say  less  than  2)  between  two  satisfactory 
models  may  be  ignored  in  the  interest  of  model  simplicity.  In  Section  6.3  we  consider 
the  problem  of  testing  for  a  unit  root  of  either  the  autoregressive  or  moving-average 
polynomial.  An  autoregressive  unit  root  suggests  that  the  data  require  differencing,  and 
a  moving-average  unit  root  suggests  that  they  have  been  overdifferenced.  Section  6.4 
considers  the  prediction  of  ARIMA  processes,  which  can  be  carried  out  using  an 
extension  of  the  techniques  developed  for  ARMA  processes  in  Sections  3.3  and  5.4. 
In  Section  6.5  we  examine  the  fitting  and  prediction  of  seasonal  ARIMA  (SARIMA) 
models,  whose  analysis,  except  for  certain  aspects  of  model  identification,  is  quite 
analogous  to  that  of  ARIMA  processes.  Finally,  we  consider  the  problem  of  regression, 
allowing  for  dependence  between  successive  residuals  from  the  regression.  Such 
models  are  known  as  regression  models  with  time  series  residuals  and  often  occur 
in  practice  as  natural  representations  for  data  containing  both  trend  and  serially 
dependent  errors. 
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We  have  already  discussed  the  importance  of  the  class  of  ARMA  models  for  represent¬ 
ing  stationary  series.  A  generalization  of  this  class,  which  incorporates  a  wide  range  of 
nonstationary  series,  is  provided  by  the  ARIMA  processes,  i.e.,  processes  that  reduce 
to  ARMA  processes  when  differenced  finitely  many  times. 


Definition  6.1.1 


If  d  is  a  nonnegative  integer,  then  {Xr}  is  an  ARIMA^rf,?)  process  if  Yt  \= 
(1  —  B)dXt  is  a  causal  ARMA (p,  q )  process. 


This  definition  means  that  {Xr}  satisfies  a  difference  equation  of  the  form 

4>*(B)X,  =  <p{B){\  -  B)dXt  =  6{B)Zt,  {Z,}  ~  WN  (0,  a2) ,  (6.1.1) 

where  0  (z)  and  6  (z)  are  polynomials  of  degrees  p  and  q ,  respectively,  and  0  (z)  ^  0 
for  |z|  <  1.  The  polynomial  0*(z)  has  a  zero  of  order  d  at  z  =  1.  The  process  {Xr}  is 
stationary  if  and  only  if  d  —  0,  in  which  case  it  reduces  to  an  ARMA (p,  q)  process. 

Notice  that  if  d  >  1,  we  can  add  an  arbitrary  polynomial  trend  of  degree 
(d  —  1)  to  P6}  without  violating  the  difference  equation  (6.1.1).  ARIMA  models 
are  therefore  useful  for  representing  data  with  trend  (see  Sections  1.5  and  6.2).  It 
should  be  noted,  however,  that  ARIMA  processes  can  also  be  appropriate  for  modeling 
series  with  no  trend.  Except  when  d  =  0,  the  mean  of  {Xr}  is  not  determined  by 
equation  (6.1.1),  and  it  can  in  particular  be  zero  (as  in  Example  1.3.3).  Since  for  d  >  1, 
equation  (6.1.1)  determines  the  second-order  properties  of  {(1  — B)dXt }  but  not  those  of 
{Xr}  (Problem  6.1),  estimation  of  0, 0,  and  a2  will  be  based  on  the  observed  differences 
(1  —  B)dXt.  Additional  assumptions  are  needed  for  prediction  (see  Section  6.4). 

Example  6.1.1  {Xt}  is  an  ARIMA(1,1,0)  process  if  for  some  0  e  (— 1,  1), 

(1  —  4>B){\  —  B)Xt  —  Zt,  {Z()~WN(0,ff2). 
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Figure  6-1 

200  observations  of  the 
ARIMA(1 ,1 ,0)  series 
Xt  of  Example  6.1 .1 


Figure  6-2 

The  sample  ACF  of  the 
data  in  Figure  6-1 
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We  can  then  write 

t 

Xt  =  X0  +  J2yj’  *>!. 

7=1 

where 

oo 

Y,  =  (\-  B)X,  = 

7=0 

A  realization  of  {X\ , . . .  ,  X200}  with  Xq  =  0,  </>  =  0.8,  and  cr2  =  1  is  shown  in 
Figure  6-1,  with  the  corresponding  sample  autocorrelation  and  partial  autocorrelation 
functions  in  Figures  6-2  and  6-3,  respectively. 

□ 

A  distinctive  feature  of  the  data  that  suggests  the  appropriateness  of  an  ARIMA 
model  is  the  slowly  decaying  positive  sample  autocorrelation  function  in  Figure  6-2. 
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Figure  6-3 

The  sample  PACF  of 
the  data  in  Figure  6-1 
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Figure  6-4 

1 99  observations  of  the 
series  Yt  =  VXj  with 
{Xj}  as  in  Figure  6-1 


If,  therefore,  we  were  given  only  the  data  and  wished  to  find  an  appropriate  model,  it 
would  be  natural  to  apply  the  operator  V  =  1  —  B  repeatedly  in  the  hope  that  for  some 
j,  {V7X}  will  have  a  rapidly  decaying  sample  autocorrelation  function  compatible 
with  that  of  an  ARMA  process  with  no  zeros  of  the  autoregressive  polynomial 
near  the  unit  circle.  For  this  particular  time  series,  one  application  of  the  operator 
V  produces  the  realization  shown  in  Figure  6-4,  whose  sample  ACF  and  PACF 
(Figures  6-5  and  6-6)  suggest  an  AR(1)  [or  possibly  AR(2)]  model  for  {VXJ.  The 
maximum  likelihood  estimates  of  0  and  a2  obtained  from  ITSM  under  the  assumption 
that  E(yXt)  =  0  (found  by  not  subtracting  the  mean  after  differencing  the  data)  are 
0.808  and  0.978,  respectively,  giving  the  model 

(1  -  0.8082?) (1  -  B)Xt  =  Z„  {Ztj  ~  WN(0,  0.978),  (6.1.2) 


which  bears  a  close  resemblance  to  the  true  underlying  process, 
(1  -  0.8B)(1  -  B)Xt  =  Zt,  {Ztj  ~  WN(0,  1). 


(6.1.3) 
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Figure  6-5 

The  sample  ACF  of  the 
series  {Yd  in  Figure  6-4 


Figure  6-6 

The  sample  PACF  of  the 
series  {Yj}  in  Figure  6-4 
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Instead  of  differencing  the  series  in  Figure  6-1  we  could  proceed  more  directly  by 
attempting  to  fit  an  AR(2)  process  as  suggested  by  the  sample  PACF  of  the  original 
series  in  Figure  6-3.  Maximum  likelihood  estimation,  carried  out  using  ITSM  after 
fitting  a  preliminary  model  with  Burg’s  algorithm  and  assuming  that  EXt  =  0,  gives 
the  model 

(1  -  1.8085  +  0.81152)Xr  =  (1  -  0.8255) (1  -  0.9835)X,  =  Z„ 

{Z,}  ~WN (0,0.970),  (6.1.4) 

which,  although  stationary,  has  coefficients  closely  resembling  those  of  the  true 
nonstationary  process  (6.1.3).  (To  obtain  the  model  (6.1.4),  two  optimizations  were 
carried  out  using  the  Model>Estimation>Max  likelihood  option  of  ITSM, 
the  first  with  the  default  settings  and  the  second  after  setting  the  accuracy  parameter 
to  0.00001.)  From  a  sample  of  finite  length  it  will  be  extremely  difficult  to  distinguish 
between  a  nonstationary  process  such  as  (6.1.3),  for  which  0*(1)  =  0,  and  a  process 
such  as  (6.1.4),  which  has  very  similar  coefficients  but  for  which  </>*  has  all  of  its 
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Figure  6-7 

200  observations  of 
the  AR(2)  process 
defined  by  (6.1 .6)  with 
r  =  1 .005  and  co  =  tt/3 


zeros  outside  the  unit  circle.  In  either  case,  however,  if  it  is  possible  by  differencing 
to  generate  a  series  with  rapidly  decaying  sample  ACF,  then  the  differenced  data  set 
can  be  fitted  by  a  low-order  ARMA  process  whose  autoregressive  polynomial  0*  has 
zeros  that  are  comfortably  outside  the  unit  circle.  This  means  that  the  fitted  parameters 
will  be  well  away  from  the  boundary  of  the  allowable  parameter  set.  This  is  desirable 
for  numerical  computation  of  parameter  estimates  and  can  be  quite  critical  for  some 
methods  of  estimation.  For  example,  if  we  apply  the  Yule- Walker  equations  to  fit  an 
AR(2)  model  to  the  data  in  Figure  6-1,  we  obtain  the  model 

(1  -  1.2825  +  0.290 B2)Xt  =  Z„  {Zt}  ~  WN(0,  6.435),  (6.1.5) 

which  bears  little  resemblance  to  either  the  maximum  likelihood  model  (6.1.4)  or  the 

A 

true  model  (6.1.3).  In  this  case  the  matrix  R2  appearing  in  (5.1.7)  is  nearly  singular. 

An  obvious  limitation  in  fitting  an  ARIMA (p,  d ,  q )  process  {YJ  to  data  is  that 
{YJ  is  permitted  to  be  nonstationary  only  in  a  very  special  way,  i.e.,  by  allowing  the 
polynomial  </>*(#)  in  the  representation  </>*(#) Xt  =  Zt  to  have  a  zero  of  multiplicity 
d  at  the  point  1  on  the  unit  circle.  Such  models  are  appropriate  when  the  sample  ACF 
is  a  slowly  decaying  positive  function  as  in  Figure  6-2,  since  sample  autocorrelation 
functions  of  this  form  are  associated  with  models  (p*(B) Xt  =  6{B)Zt  in  which  0*  has  a 
zero  either  at  or  close  to  1 . 

Sample  autocorrelations  with  slowly  decaying  oscillatory  behavior  as  in  Fig¬ 
ure  6-8  are  associated  with  models  </>*(#) Xt  =  9{B)Zt  in  which  0*  has  a  zero  close  to 
el(D  for  some  co  e  (—tv,  tv]  other  than  0.  Figure  6-8  is  the  sample  ACF  of  the  series  of 
200  observations  in  Figure  6-7,  obtained  from  ITSM  by  simulating  the  AR(2)  process 

X,  -  (2r_1  cos  <»)*,_,  +  r~2X,_ 2  =  Z„  {Zt}  ~  WN(0,  1),  (6.1.6) 

with  r  —  1.005  and  co  =  7t/3,  i.e., 

Xt  -  0.9950Yr_!  +  0.9901Yr_2  =  Z„  {Zt}  -  WN(0,  1). 

The  autocorrelation  function  of  the  model  (6.1.6)  can  be  derived  by  noting  that 
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Figure  6-8 

The  sample  ACF  of  the 
data  in  Figure  6-7 
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1  —  (2 r-1  cos  co)  B  + 

and  using  (3.2.12).  This  gives 

h sin  Quo  +  yjr) 


r~2B 2 


p(h)  =  r' 


sin  x/r 


=  (1  -  r~ 1  euoB)  (1  -  r~le~i0}B) 
h>  0, 


where 


r2  +  1 

tan  xj/  =  — - tan  co . 

r2  —  1 

It  is  clear  from  these  equations  that 
p(h)  ->  cos(/z&>)  as  r  ^  1. 


(6.1.7) 

(6.1.8) 


(6.1.9) 

(6.1.10) 


With  r  —  1.005  and  co  —  tt/3  as  in  the  model  generating  Figure  6-7,  the  model 
ACF  (6.1.8)  is  a  damped  sine  wave  with  damping  ratio  1/1.005  and  period  6.  These 
properties  are  reflected  in  the  sample  ACF  shown  in  Figure  6-8.  For  values  of  r  closer 
to  1,  the  damping  will  be  even  slower  as  the  model  ACF  approaches  its  limiting  form 
(6.1.10). 

If  we  were  simply  given  the  data  shown  in  Figure  6-7,  with  no  indication  of  the 
model  from  which  it  was  generated,  the  slowly  damped  sinusoidal  sample  ACF  with 
period  6  would  suggest  trying  to  make  the  sample  ACF  decay  more  rapidly  by  applying 
the  operator  (6.1.7)  with  r  —  1  and  co  —  tt/3 ,  i.e.,  (l  —  B  +  B 2).  If  it  happens,  as  in 
this  case,  that  the  period  lit /co  is  close  to  some  integer  s  (in  this  case  6),  then  the 
operator  1  —  Bs  can  also  be  applied  to  produce  a  series  with  more  rapidly  decaying 
autocorrelation  function  (see  also  Section  6.5).  Figures  6-9  and  6-10  show  the  sample 
autocorrelation  functions  obtained  after  applying  the  operators  1  —  B  +  B2  and  1  —  B6, 
respectively,  to  the  data  shown  in  Figure  6-7.  For  either  one  of  these  two  differenced 
series,  it  is  then  not  difficult  to  fit  an  ARMA  model  (p(B)Xt  =  6(B)Zt  for  which  the 
zeros  of  <fi  are  well  outside  the  unit  circle.  Techniques  for  identifying  and  determining 
such  ARMA  models  have  already  been  introduced  in  Chapter  5.  For  convenience  we 
shall  collect  these  together  in  the  following  sections  with  a  number  of  illustrative 
examples. 
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(1  -6  +  fi2)Xtwith  0  10  20  30  40 

{Xf}  as  in  Figure  6-7  Lag 


of  (1  -  S6)Xf  with  0  10  20  30  40 

(Xf)  as  in  Figure  6-7  Lag 


6.2  Identification  Techniques 

(a)  Preliminary  Transformations.  The  estimation  methods  of  Chapter  5  enable  us  to 
find,  for  given  values  of  p  and  q ,  an  ARM  A  (p,  q)  model  to  fit  a  given  series  of  data. 
For  this  procedure  to  be  meaningful  it  must  be  at  least  plausible  that  the  data  are  in 
fact  a  realization  of  an  ARMA  process  and  in  particular  a  realization  of  a  stationary 
process.  If  the  data  display  characteristics  suggesting  nonstationarity  (e.g.,  trend  and 
seasonality),  then  it  may  be  necessary  to  make  a  transformation  so  as  to  produce  a  new 
series  that  is  more  compatible  with  the  assumption  of  stationarity. 

Deviations  from  stationarity  may  be  suggested  by  the  graph  of  the  series  itself  or 
by  the  sample  autocorrelation  function  or  both. 
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Figure  6-1 1 

The  Australian  red 
wine  data  after  taking 
natural  logarithms  and 
removing  a  seasonal 
component  of  period 
1 2  and  a  linear  trend 


Inspection  of  the  graph  of  the  series  will  occasionally  reveal  a  strong  dependence 
of  variability  on  the  level  of  the  series,  in  which  case  the  data  should  first  be 
transformed  to  reduce  or  eliminate  this  dependence.  For  example,  Figure  1-1  shows 
the  Australian  monthly  red  wine  sales  from  January  1980  through  October  1991, 
and  Figure  1-17  shows  how  the  increasing  variability  with  sales  level  is  reduced 
by  taking  natural  logarithms  of  the  original  series.  The  logarithmic  transformation 
Vt  —  In  Ut  used  here  is  in  fact  appropriate  whenever  { Ut }  is  a  series  whose  standard 
deviation  increases  linearly  with  the  mean.  For  a  systematic  account  of  a  general  class 
of  variance-stabilizing  transformations,  we  refer  the  reader  to  Box  and  Cox  (1964). 
The  defining  equation  for  the  general  Box-Cox  transformation/,  is 

\x-\uf-i),  Ut  >  0,  X  >  0, 

Mu, )  = 

[in  Ut,  U,  >  0,  X  =  0, 

and  the  program  ITSM  provides  the  option  (Transf  orm>Box-Cox)  of  applying  /, 
(with  0  <  A  <  1.5)  prior  to  the  elimination  of  trend  and/or  seasonality  from  the  data. 
In  practice,  if  a  Box-Cox  transformation  is  necessary,  it  is  often  the  case  that  either /o 
or/0.5  is  adequate. 

Trend  and  seasonality  are  usually  detected  by  inspecting  the  graph  of  the  (possibly 
transformed)  series.  However,  they  are  also  characterized  by  autocorrelation  functions 
that  are  slowly  decaying  and  nearly  periodic,  respectively.  The  elimination  of  trend 
and  seasonality  was  discussed  in  Section  1.5,  where  we  described  two  methods: 

(i)  “classical  decomposition”  of  the  series  into  a  trend  component,  a  seasonal 
component,  and  a  random  residual  component,  and 

(ii)  differencing. 

The  program  ITSM  (in  the  Transform  option)  offers  a  choice  between  these  tech¬ 
niques.  The  results  of  applying  methods  (i)  and  (ii)  to  the  transformed  red  wine  data 
Vt  =  In  Ut  in  Figure  1-17  are  shown  in  Figures  6-11  and  6-12,  respectively.  Figure  6-11 
was  obtained  from  ITSM  by  estimating  and  removing  from  {V?}  a  linear  trend 
component  and  a  seasonal  component  with  period  12.  Figure  6-12  was  obtained  by 
applying  the  operator  (l  —  Bn )  to  {A/}.  Neither  of  the  two  resulting  series  displays 
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Figure  6-12 

The  Australian  red 
wine  data  after  taking 
natural  logarithms  and 
differencing  at  lag  12 


any  apparent  deviations  from  stationarity,  nor  do  their  sample  autocorrelation  func¬ 
tions.  The  sample  ACF  and  PACF  of  {(1  —  Bn)Vt]  are  shown  in  Figures  6-13  and 
6-14,  respectively. 

After  the  elimination  of  trend  and  seasonality,  it  is  still  possible  that  the  sample 
autocorrelation  function  may  appear  to  be  that  of  a  nonstationary  (or  nearly  nonsta¬ 
tionary)  process,  in  which  case  further  differencing  may  be  carried  out. 

In  Section  1.5  we  also  mentioned  a  third  possible  approach: 

(iii)  fitting  a  sum  of  harmonics  and  a  polynomial  trend  to  generate  a  noise  sequence 
that  consists  of  the  residuals  from  the  regression. 

In  Section  6.6  we  discuss  the  modifications  to  classical  least  squares  regression 
analysis  that  allow  for  dependence  among  the  residuals  from  the  regression.  These 
modifications  are  implemented  in  the  ITSM  option  Regression>Estimation> 
Generalized  LS. 

(b)  Identification  and  Estimation.  Let  {XJ  be  the  mean-corrected  transformed 
series  found  as  described  in  (a).  The  problem  now  is  to  find  the  most  satisfactory 
ARM  A  (p,  q )  model  to  represent  {Xr}.  If  p  and  q  were  known  in  advance,  this  would 
be  a  straightforward  application  of  the  estimation  techniques  described  in  Chapter  5. 
However,  this  is  usually  not  the  case,  so  it  becomes  necessary  also  to  identify 
appropriate  values  for  p  and  q. 

It  might  appear  at  first  sight  that  the  higher  the  values  chosen  for  p  and  q ,  the 
better  the  resulting  fitted  model  will  be.  However,  as  pointed  out  in  Section  5.5, 
estimation  of  too  large  a  number  of  parameters  introduces  estimation  errors  that 
adversely  affect  the  use  of  the  fitted  model  for  prediction  as  illustrated  in  Section  5.4. 
We  therefore  minimize  one  of  the  model  selection  criteria  discussed  in  Section  5.5  in 
order  to  choose  the  values  of  p  and  q.  Each  of  these  criteria  includes  a  penalty  term 
to  discourage  the  fitting  of  too  many  parameters.  We  shall  base  our  choice  of  p  and 
q  primarily  on  the  minimization  of  the  AICC  statistic,  defined  as 


AIC C(</>,  6)  =  —2  In  L(</>,  6 ,  S(<fi ,  0)/n )  +  2 (p  +  q  +  1  )n/(n  —  p  —  q  —  2), 

(6.2.1) 
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The  sample  PACF  of 
the  data  in  Figure  6-12 
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where  L(0,  6 ,  a2)  is  the  likelihood  of  the  data  under  the  Gaussian  ARMA  model  with 
parameters  (0,  0,  a2),  and  5(0,  6)  is  the  residual  sum  of  squares  defined  in  (5.2.11). 
Once  a  model  has  been  found  that  minimizes  the  AICC  value,  it  is  then  necessary  to 
check  the  model  for  goodness  of  fit  (essentially  by  checking  that  the  residuals  are  like 
white  noise)  as  discussed  in  Section  5.3. 

For  any  fixed  values  of  p  and  q ,  the  maximum  likelihood  estimates  of  0  and 
6  are  the  values  that  minimize  the  AICC.  Hence,  the  minimum  AICC  model  (over 
any  given  range  of  p  and  q  values)  can  be  found  by  computing  the  maximum 
likelihood  estimators  for  each  fixed  p  and  q  and  choosing  from  these  the  maximum 
likelihood  model  with  the  smallest  value  of  AICC.  This  can  be  done  with  the  program 
ITSM  by  using  the  option  Model>Estimation>Autof it.  When  this  option 
is  selected  and  upper  and  lower  bounds  for  p  and  q  are  specified,  the  program 
fits  maximum  likelihood  models  for  each  pair  (p,  q)  in  the  range  specified  and 
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selects  the  model  with  smallest  AICC  value.  If  some  of  the  coefficient  estimates  are 
small  compared  with  their  estimated  standard  deviations,  maximum  likelihood  subset 
models  (with  those  coefficients  set  to  zero)  can  also  be  explored. 

The  steps  in  model  identification  and  estimation  can  be  summarized  as  follows: 

•  After  transforming  the  data  (if  necessary)  to  make  the  fitting  of  an  ARMA (p,  q ) 
model  reasonable,  examine  the  sample  ACF  and  PACF  to  get  some  idea  of  potential 
p  and  q  values.  Preliminary  estimation  using  the  ITSM  option  Model >Esti- 
mation>Preliminary  is  also  useful  in  this  respect.  Burg’s  algorithm  with 
AICC  minimization  rapidly  fits  autoregressions  of  all  orders  up  to  27  and  selects  the 
one  with  minimum  AICC  value.  For  preliminary  estimation  of  models  with  q  >  0, 
each  pair  (p ,  q)  must  be  considered  separately. 

•  Select  the  option  Model>Estimation>Autof it  of  ITSM.  Specify  the 
required  limits  for  p  and  q ,  and  the  program  will  then  use  maximum  likelihood 
estimation  to  find  the  minimum  AICC  model  with  p  and  q  in  the  range  specified. 

•  Examination  of  the  fitted  coefficients  and  their  standard  errors  may  suggest  that 
some  of  them  can  be  set  to  zero.  If  this  is  the  case,  then  a  subset  model  can  be 
fitted  by  clicking  on  the  button  Constrain  optimizationin  the  Maximum 
Likelihood  Estimation  dialog  box  and  setting  the  selected  coefficients  to 
zero.  Optimization  will  then  give  the  maximum  likelihood  model  with  the  chosen 
coefficients  constrained  to  be  zero.  The  constrained  model  is  assessed  by  comparing 
its  AICC  value  with  those  of  the  other  candidate  models. 

•  Check  the  candidate  model(s)  for  goodness  of  fit  as  described  in  Section  5.3. 
These  tests  can  be  performed  by  selecting  the  option  Statistics >Resi dual 
Analysis. 

Example  6.2.1  The  Australian  Red  Wine  Data 

Let  {X\ , . . . ,  Xi3o)  denote  the  series  obtained  from  the  red  wine  data  of  Example  1.1.1 
after  taking  natural  logarithms,  differencing  at  lag  12,  and  subtracting  the  mean 
(0.0681)  of  the  differences.  The  data  prior  to  mean  correction  are  shown  in  Figure  6-12. 
The  sample  PACF  of  {XJ,  shown  in  Figure  6-14,  suggests  that  an  AR(12)  model 
might  be  appropriate  for  this  series.  To  explore  this  possibility  we  use  the  ITSM 
option  Model>Estimation>Preliminary  with  Burg’s  algorithm  and  AICC 
minimization.  As  anticipated,  the  fitted  Burg  models  do  indeed  have  minimum  AICC 
when  p  —  12.  The  fitted  model  is 

(1  -  0.2455  -  0.06952  -  0.01253  -  0.02 IS4  -  0.2006s  +0.02556  +0.00467 

-  0.13368  +0.01069  -  0.095610  +0.118611  +0.384B12)X,  =  Zf, 

with  { Zt }  ~  WN(0,  0.0135)  and  AICC  value  —158.77.  Selecting  the  option  Model > 
Estimation>Max  likelihood  then  gives  the  maximum  likelihood  AR(12) 
model,  which  is  very  similar  to  the  Burg  model  and  has  AICC  value  —158.87. 
Inspection  of  the  standard  errors  of  the  coefficient  estimators  suggests  the  possibility 
of  setting  those  at  lags  2,3,4,6,7,9,10,  and  11  equal  to  zero.  If  we  do  this  by  click¬ 
ing  on  the  Constrain  optimization  button  in  the  Maximum  Likelihood 
Estimation  dialog  box  and  then  reoptimize,  we  obtain  the  model, 

(1  -  0.2705  -  0.2245s  -  0.14958  +  0.099511  +  0.353512)Xr  =  Z,, 

with  {Z,}  ~  WN(0,  0.0138)  and  AICC  value  -172.49. 

In  order  to  check  more  general  ARMA (p,  q)  models,  select  the  option  Model> 
Estimation>Autof  it  and  specify  the  minimum  and  maximum  values  of  p  and 
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q  to  be  zero  and  15,  respectively.  (The  sample  ACF  and  PACF  suggest  that  these  limits 
should  be  more  than  adequate  to  include  the  minimum  AICC  model.)  In  a  few  minutes 
(depending  on  the  speed  of  your  computer)  the  program  selects  an  ARM A(  1,12)  model 
with  AICC  value  —172.74,  which  is  slightly  better  than  the  subset  AR(12)  model 
just  found.  Inspection  of  the  estimated  standard  deviations  of  the  MA  coefficients  at 
lags  1,  3,  4,  6,  7,  9,  and  11  suggests  setting  them  equal  to  zero  and  reestimating  the 
values  of  the  remaining  coefficients.  If  we  do  this  by  clicking  on  the  Constrain 
optimization  button  in  the  Maximum  Likelihood  Estimation  dialog 
box,  setting  the  required  coefficients  to  zero  and  then  reoptimizing,  we  obtain  the 
model, 

(1  -0.286 B)X,  =  (l  +  0.127B2  +  0.183B5  +  0.177B8  +  0.181B10  -  0.554S12')  Zf, 

with  {Zr}  ~  WN(0,  0.0120)  and  AICC  value  -184.09. 

The  subset  ARM A(  1,12)  model  easily  passes  all  the  goodness  of  fit  tests 
in  the  Statistics>Residual  Analysis  option.  In  view  of  this  and  its  small 
AICC  value,  we  accept  it  as  a  plausible  model  for  the  transformed  red  wine  series. 

□ 


Example  6.2.2  The  Lake  Data 

Let  {Yt,  t  —  1, . . . ,  99}  denote  the  lake  data  of  Example  1.3.5.  We  have  seen  already 
in  Example  5.2.5  that  the  ITSM  option  Model>Estimation>Autof  it  gives  the 
minimum- AICC  model 

Xf-0.7446Xj_i=Zr+0.3213^_i,  {Zt}  ~  WN(0,  0.4750), 

for  the  mean-corrected  series  Xt  —  Yt  —  9.0041.  The  corresponding  AICC  value  is 
212.77 .  Since  the  model  passes  all  the  goodness  of  fit  tests,  we  accept  it  as  a  reasonable 
model  for  the  data. 

□ 


6.3  Unit  Roots  in  Time  Series  Models 

The  unit  root  problem  in  time  series  arises  when  either  the  autoregressive  or  moving- 
average  polynomial  of  an  ARM  A  model  has  a  root  on  or  near  the  unit  circle.  A 
unit  root  in  either  of  these  polynomials  has  important  implications  for  modeling. 
For  example,  a  root  near  1  of  the  autoregressive  polynomial  suggests  that  the  data 
should  be  differenced  before  fitting  an  ARMA  model,  whereas  a  root  near  1  of 
the  moving-average  polynomial  indicates  that  the  data  were  overdifferenced.  In  this 
section,  we  consider  inference  procedures  for  detecting  the  presence  of  a  unit  root  in 
the  autoregressive  and  moving-average  polynomials. 


6.3.1  Unit  Roots  in  Autoregressions 

In  Section  6.1  we  discussed  the  use  of  differencing  to  transform  a  nonstationary  time 
series  with  a  slowly  decaying  sample  ACF  and  values  near  1  at  small  lags  into  one 
with  a  rapidly  decreasing  sample  ACF.  The  degree  of  differencing  of  a  time  series  {Xt} 
was  largely  determined  by  applying  the  difference  operator  repeatedly  until  the  sample 
ACF  of  { decays  quickly.  The  differenced  time  series  could  then  be  modeled  by 
a  low-order  ARMA (p,  q )  process,  and  hence  the  resulting  ARIMA(p,  d,  q)  model 
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for  the  original  data  has  an  autoregressive  polynomial  (l  —  01  z - <t>Pzp)(  1  — z)d  [see 

(6.1.1)]  with  d  roots  on  the  unit  circle.  In  this  subsection  we  discuss  a  more  systematic 
approach  to  testing  for  the  presence  of  a  unit  root  of  the  autoregressive  polynomial  in 
order  to  decide  whether  or  not  a  time  series  should  be  differenced.  This  approach  was 
pioneered  by  Dickey  and  Fuller  (1979). 

Let  26,  . . . ,  Xn  be  observations  from  the  AR(1)  model 

Xt  —  n  =  <pi(X,_\  —  fi)  +  Zu  {Z(}~WN(0,ff2),  (6.3.1) 

A 

where  |0i  |  <  1  and  /z  =  EXt.  For  large  n ,  the  maximum  likelihood  estimator  0i  of  0 1 
is  approximately  N(0i ,  (l  —  02) / n ) .  For  the  unit  root  case,  this  normal  approximation 
is  no  longer  applicable,  even  asymptotically,  which  precludes  its  use  for  testing  the 
unit  root  hypothesis  Hq  :  0i  =  1  vs.  H\  :  0i  <  1.  To  construct  a  test  of  Hq,  write  the 
model  (6.3.1)  as 

VXt  =  Xt-Xt-1=4>Z  +  4>*lXt-1+Zt,  {Zr}  ~  WN  (0,  a2)  ,  (6.3.2) 

A 

where  0q  =  /z(  1  —  0i)  and  0*  =  0X  —  1.  Now  let  0^  be  the  ordinary  least  squares 
(OLS)  estimator  of  (j)\  found  by  regressing  V26  on  1  and  26- 1-  The  estimated  standard 

A 

error  of  0^  is 


se(<p*)  =  sIj2(x>-i-x)2 

\t= 2 

where  S 2  =  Ym=2  —  0q  —  0*26-1  j  /(n  —  3)  and  X  is  the  sample  mean  of 

Xi,  ,  Xn_\.  Dickey  and  Fuller  derived  the  limit  distribution  as  n  oo  of  the  t- 
ratio 

:=  0r/SE  (0j)  (6.3.3) 

under  the  unit  root  assumption  0^  =  0,  from  which  a  test  of  the  null  hypothesis 
Hq  :  0i  —  1  can  be  constructed.  The  0.01,  0.05,  and  0.10  quantiles  of  the  limit 
distribution  of  fM  (see  Table  8.5.2  of  Fuller  1976)  are  —3.43,  —2.86,  and  —2.57, 
respectively.  The  augmented  Dickey-Fuller  test  then  rejects  the  null  hypothesis  of  a 
unit  root,  at  say,  level  0.05  if  x/L  <  —2.86.  Notice  that  the  cutoff  value  for  this  test 
statistic  is  much  smaller  than  the  standard  cutoff  value  of  —1.645  obtained  from  the 
normal  approximation  to  the  ^-distribution,  so  that  the  unit  root  hypothesis  is  less  likely 
to  be  rejected  using  the  correct  limit  distribution. 

The  above  procedure  can  be  extended  to  the  case  where  {26}  follows  the  AR (p) 
model  with  mean  /z  given  by 

Xt  —  /x  =  0i  (26-i  —  /z)  +  •  •  •  +  (pp  (Xt~p  —  /x)  +  Zt ,  {Zt}  ~  WN(0,  cr2). 

This  model  can  be  rewritten  as  (see  Problem  6.2) 

V26  =  00*  +  0P6— !  +  02  V26—  i  +  ■  ■  ■  +  0;  VX,_P+1  +  Zu  (6.3.4) 

where  0O  =  /x  (l  -  0i - 4>P)  >  4>t  =  ELi  <t>i  ~  h  and  <pj  =  -  5X/0;,  j  = 

2,  . . . ,  p.  If  the  autoregressive  polynomial  has  a  unit  root  at  1,  then  0  =  0(1)  =  — 0*, 
and  the  differenced  series  {V26}  is  an  AR {p  —  1)  process.  Consequently,  testing  the 
hypothesis  of  a  unit  root  at  1  of  the  autoregressive  polynomial  is  equivalent  to  testing 
0^  =  0.  As  in  the  AR(1)  example,  0^  can  be  estimated  as  the  coefficient  of  26_i  in  the 
OLS  regression  of  V26  onto  1, 26_i,  V26-i,  . . . ,  V26-p+i.  For  large  n  the  Fratio 

:=&/SE(#), 


(6.3.5) 
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Example  6.3.1 


A 


where  SE  J  is  the  estimated  standard  error  of  0*,  has  the  same  limit  distribution  as 

the  test  statistic  in  (6.3.3).  The  augmented  Dickey-Fuller  test  in  this  case  is  applied  in 
exactly  the  same  manner  as  for  the  AR(1)  case  using  the  test  statistic  (6.3.5)  and  the 
cutoff  values  given  above. 


Consider  testing  the  time  series  of  Example  6.1.1  (see  Figure  6-1)  for  the  presence 
of  a  unit  root  in  the  autoregressive  operator.  The  sample  PACF  in  Figure  6-3  sug¬ 
gests  fitting  an  AR(2)  or  possibly  an  AR(3)  model  to  the  data.  Regressing  VXr  on 
VXf_i,  VXr_2  for  t  =  4, . . . ,  200  using  OFS  gives 

VX,  =0.1503  -  0.0041X,_!  +  0.9335  VX,_i  -  0.1548VX,_2  +  Z„ 

(0.1135)  (0.0028)  (0.0707)  (0.0708) 


where  {Zt}  ~  WN(0,  0.9639).  The  test  statistic  for  testing  the  presence  of  a  unit  root  is 


-0.0041 

Xu  = - =  -1.464. 


'/X 


0.0028 


Since  -1.464  >  —2.57,  the  unit  root  hypothesis  is  not  rejected  at  level  0.10.  In 
contrast,  if  we  had  mistakenly  used  the  ^-distribution  with  193  degrees  of  freedom 
as  an  approximation  to  f/x,  then  we  would  have  rejected  the  unit  root  hypothesis  at 
the  0.10  level  (p-value  is  0.074).  The  ^-ratios  for  the  other  coefficients,  0q,  0|,  and 
03,  have  an  approximate  ^-distribution  with  193  degrees  of  freedom.  Based  on  these 
^-ratios,  the  intercept  should  be  0,  while  the  coefficient  of  VXr_2  is  barely  significant. 
The  evidence  is  much  stronger  in  favor  of  a  unit  root  if  the  analysis  is  repeated  without 
a  mean  term.  The  fitted  model  without  a  mean  term  is 


VX,  =0.0012X,_i  +  0.9395 VX,_!  -  0.1585VX,_2  +  Z„ 
(0.0018)  (0.0707)  (0.0709) 


where  {Zt}  ~  WN(0,  0.9677).  The  0.01,  0.05,  and  0.10  cutoff  values  for  the 
corresponding  test  statistic  when  a  mean  term  is  excluded  from  the  model  are  —2.58, 
—  1.95,  and  —1.62  (see  Table  8.5.2  of  Fuller  1976).  In  this  example,  the  test  statistic  is 


.  _  -0.0012 

r  ”  0.0018 


-0.667, 


which  is  substantially  larger  than  the  0.10  cutoff  value  of  —1.62. 

□ 

Further  extensions  of  the  above  test  to  AR  models  with  p  =  0(n 1/3)  and  to 
ARMA (p,  q )  models  can  be  found  in  Said  and  Dickey  (1984).  However,  as  reported 
in  Schwert  (1987)  and  Pantula  (1991),  this  test  must  be  used  with  caution  if  the 
underlying  model  orders  are  not  correctly  specified. 


6.3.2  Unit  Roots  in  Moving  Averages 

A  unit  root  in  the  moving-average  polynomial  can  have  a  number  of  interpretations 
depending  on  the  modeling  application.  For  example,  let  {Xr}  be  a  causal  and  invertible 
ARMA (p,  q)  process  satisfying  the  equations 

4>  (B)X,  =  e  (B)Zh  {Zt}  ~WN(0,a2). 

Then  the  differenced  series  Yt  \=  VXt  is  a  noninvertible  ARMA (p,  q  +  1)  process 
with  moving-average  polynomial  0(z)(  1  —  z).  Consequently,  testing  for  a  unit  root  in 
the  moving-average  polynomial  is  equivalent  to  testing  that  the  time  series  has  been 
overdifferenced. 
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As  a  second  application,  it  is  possible  to  distinguish  between  the  competing  models 
VkX,  =  a+V, 
and 

Xt  =  Co  +  c\t  +  •  •  •  +  cktk  +  Wu 


where  {Vt}  and  {Wt}  are  invertible  ARMA  processes.  For  the  former  model  the 
differenced  series  { VkXt }  has  no  moving-average  unit  roots,  while  for  the  latter  model 
{ has  a  multiple  moving-average  unit  root  of  order  k.  We  can  therefore  distinguish 
between  the  two  models  by  using  the  observed  values  of  { S/kXl }  to  test  for  the  presence 
of  a  moving-average  unit  root. 

We  confine  our  discussion  of  unit  root  tests  to  first-order  moving-average  models, 
the  general  case  being  considerably  more  complicated  and  not  fully  resolved.  Let 
X\,  ,  Xnbe  observations  from  the  MA(1)  model 

Xt  =  Zt  +  0Zt_u  iztj  ~  IID  (0,  a2) . 

A  A 

Davis  and  Dunsmuir  (1996)  showed  that  under  the  assumption  9  =  —  1,  n{9  +  1)  ( 9  is 
the  maximum  likelihood  estimator)  converges  in  distribution.  A  test  of  Hq  :  9  =  —  1 
vs.  H\  :  9  >  — 1  can  be  fashioned  on  this  limiting  result  by  rejecting  Hq  when 

0  >  -1  +  ca/n, 

where  ca  is  the  (1  —  a)  quantile  of  the  limit  distribution  of  n(6  +  l).  (From 
Table  3.2  of  Davis  et  al.  (1995),  co.oi  =  11.93,  co.05  =  6.80,  and  co.io  = 
4.90.)  In  particular,  if  n  —  50,  then  the  null  hypothesis  is  rejected  at  level  0.05  if 
9  >  -1  +6.80/50  =  -0.864. 

The  likelihood  ratio  test  can  also  be  used  for  testing  the  unit  root  hypothesis.  The 

likelihood  ratio  for  this  problem  is  L(—  1,  S(—l)/ri)/L  (§,  <72^,  where  L  ( 9 ,  a2)  is  the 

Gaussian  likelihood  of  the  data  based  on  an  MA(1)  model,  S(—  1)  is  the  sum  of  squares 
given  by  (5.2. 1 1)  when  6  =  —  1,  and  0  and  a  are  the  maximum  likelihood  estimators 
of  9  and  a2.  The  null  hypothesis  is  rejected  at  level  a  if 


Xn  :=  —2  In 


L(-1,S(—  l)/n) 
L  (§.  a1 


>  Clr,« 


where  the  cutoff  value  is  chosen  such  that  Pq=- \[K  >  clr^]  =  &•  The  limit 
distribution  of  Xn  was  derived  by  Davis  et  al.  (1995),  who  also  gave  selected  quantiles 
of  the  limit.  It  was  found  that  these  quantiles  provide  a  good  approximation  to  their 
finite-sample  counterparts  for  time  series  of  length  n  >  50.  The  limiting  quantiles  for 
Xn  under  Hq  are  clr,o.oi  —  4.41,  clr,o.o5  —  1*94,  and  clr,o.io  —  1-00- 


For  the  overshort  data  { Xt }  of  Example  3.2.8,  the  maximum  likelihood  MA(1)  model 
for  the  mean  corrected  data  { Yt  =  Xt  +  4.035}  was  (see  Example  5.4.1) 

Yt  —  Zt  —  0.818Z,_!,  {Ztj  -  WN(0,  2040.75). 

In  the  structural  formulation  of  this  model  given  in  Example  3.2.8,  the  moving-average 
parameter  9  was  related  to  the  measurement  error  variances  a ^  and  c>y  through  the 
equation 


1  +  92  2(7 ^  +  Gy 
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(These  error  variances  correspond  to  the  daily  measured  amounts  of  fuel  in  the  tank 
and  the  daily  measured  adjustments  due  to  sales  and  deliveries.)  A  value  of  9  =  —  1 
indicates  that  there  is  no  appreciable  measurement  error  due  to  sales  and  deliveries 
(i.e.,  Gy  —  0),  and  hence  testing  for  a  unit  root  in  this  case  is  equivalent  to  testing 
that  Gy  —  0.  Assuming  that  the  mean  is  known,  the  unit  root  hypothesis  is  rejected 
at  a  =  0.05,  since  —0.818  >  — 1  +  6.80/57  =  —0.881.  The  evidence  against  Ho  is 
stronger  using  the  likelihood  ratio  statistic.  Using  ITSM  and  entering  the  MA(1)  model 
6  =  -l  and  a2  =  2203.12,  we  find  that -21nL(-l,  2203.12)  =  604.584,  while 
—2  In L(§,  g2)  —  597 .267 .  Comparing  the  likelihood  ratio  statistic  Xn  —  604.584  — 
597.267  =  7.317  with  the  cutoff  value  clr,o.oi>  we  reject  Ho  at  level  a  =  0.01  and 
conclude  that  the  measurement  error  associated  with  sales  and  deliveries  is  nonzero. 

In  the  above  example  it  was  assumed  that  the  mean  was  known.  In  practice,  these 
tests  should  be  adjusted  for  the  fact  that  the  mean  is  also  being  estimated. 

Tanaka  (1990)  proposed  a  locally  best  invariant  unbiased  (LBIU)  test  for  the  unit 
root  hypothesis.  It  was  found  that  the  LBIU  test  has  slightly  greater  power  than  the 
likelihood  ratio  test  for  alternatives  close  to  6  —  —  1  but  has  less  power  for  alternatives 
further  away  from  —1  (see  Davis  et  al.  1995).  The  LBIU  test  has  been  extended  to 
cover  more  general  models  by  Tanaka  (1990)  and  Tam  and  Reinsel  (1995).  Similar 
extensions  to  tests  based  on  the  maximum  likelihood  estimator  and  the  likelihood  ratio 
statistic  have  been  explored  in  Davis  et  al.  (1996). 

□ 


6.4  Forecasting  ARIMA  Models 


In  this  section  we  demonstrate  how  the  methods  of  Sections  3.3  and  5.4  can  be 
adapted  to  forecast  the  future  values  of  an  ARIMA (p,  d,  q)  process  (A7).  (The  required 
numerical  calculations  can  all  be  carried  out  using  the  program  ITSM.) 

If  d  >  I,  the  first  and  second  moments  EXt  and  E(Xt+hXt )  are  not  determined  by 
the  difference  equations  (6. LI).  We  cannot  expect,  therefore,  to  determine  best  linear 
predictors  for  {Xt}  without  further  assumptions. 

For  example,  suppose  that  {Fr}  is  a  causal  ARM  A  (p,  q)  process  and  that  Xq  is  any 
random  variable.  Define 

t 

Xt  =  X o  +  y  '  Yj,  t  =  1,2,.... 

7=1 

Then  { Xt  ,  t  >  0}  is  an  ARIMA (p,  1,  q)  process  with  mean  EXt  =  EXq  and  autocovari¬ 
ances  E(Xt+kXt)  —  (EX0)2  that  depend  on  Var(Xo)  and  Cov(Xo,  Yj),j  =  1,2,....  The 
best  linear  predictor  ofXn+i  based  on  {1,  Xq,  X\, . . . ,  Xn }  is  the  same  as  the  best  linear 
predictor  in  terms  of  the  set  { 1 ,  X0,  Y\ ,  ... ,  Yn },  since  each  linear  combination  of  the 
latter  is  a  linear  combination  of  the  former  and  vice  versa.  Hence,  using  Pn  to  denote 
best  linear  predictor  in  terms  of  either  set  and  using  the  linearity  of  Pn ,  we  can  write 


PnXn+l  —  Pn(X 0  +  Y\  +  •  •  •  +  Yn+ 1)  —  Pn(Xn  +  Yn+ 1)  —  Xn  +  PnYn+\. 


To  evaluate  PnYn+\  it  is  necessary  (see  Section  2.5)  to  know  E(XoYj)J=  1,  . . . ,  n+1, 
and  EXq.  However,  if  we  assume  that  Xq  is  uncorrelated  with  { Yt ,  t  >  1},  then 

PnYn+ 1  is  the  same  (Problem  6.5)  as  the  best  linear  predictor  Yn+ 1  of  Yn+ 1  in  terms  of 
{1,  Y\, . . . ,  Yn},  which  can  be  calculated  as  described  in  Section  3.3.  The  assumption 
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that  Xq  is  uncorrelated  with  Y\,  Y2, . . .  is  therefore  sufficient  to  determine  the  best 
linear  predictor  PnXn+\  in  this  case. 

Turning  now  to  the  general  case,  we  shall  assume  that  our  observed  process  {XJ 
satisfies  the  difference  equations 

(1  -B)dXt  =  Yt,  t  =  1,2,  , 


where  {Yt}  is  a  causal  ARMA (p,  q )  process,  and  that  the  random  vector  (Xi_d,  . . Xq) 
is  uncorrelated  with  Yt ,  t  >  0.  The  difference  equations  can  be  rewritten  in  the  form 


It  is  convenient,  by  relabeling  the  time  axis  if  necessary,  to  assume  that  we  observe 
X\_d,  X2-d,  •  •  • ,  Xn.  (The  observed  values  of  { Yt }  are  then  Y\, . . . ,  Yn.)  As  usual,  we 
shall  use  Pn  to  denote  best  linear  prediction  in  terms  of  the  observations  up  to  time  n 
(in  this  case  l,X\-d,  . . . ,  Xn  or  equivalently  1,  X\_d,  . . . ,  Xq,  Y\,  ... ,  Yn). 

Our  goal  is  to  compute  the  best  linear  predictors  PnXn+h.  This  can  be  done  by 
applying  the  operator  Pn  to  each  side  of  (6.4.1)  (with  t  —  n-\-h)  and  using  the  linearity 
of  Pn  to  obtain 


PnX 


n^n+h 


(6.4.2) 


Now  the  assumption  that  (X\_d,  . . . ,  Xq)  is  uncorrelated  with  Yt,  t  >  0,  enables  us  to 
identify  PnYn+h  with  the  best  linear  predictor  of  Yn+h  in  terms  of  {1,  Y\,  . . . ,  Yn },  and 
this  can  be  calculated  as  described  in  Section  3.3.  The  predictor  PnXn+\  is  obtained 
directly  from  (6.4.2)  by  noting  that  PnXn+i-j  =  Xn+\ _y  for  each  j  >  1.  The  predictor 
PnXn+ 2  can  then  be  found  from  (6.4.2)  using  the  previously  calculated  value  of 
PnXn+\.  The  predictors  PnXn+ 3,  PnXn+ 4,  . . .  can  be  computed  recursively  in  the  same 
way. 

To  find  the  mean  squared  error  of  prediction  it  is  convenient  to  express  PnYn+h  in 

✓V 

terms  of  {Xj}.  For  n  >  0  we  denote  the  one-step  predictors  by  Yn+i  =  PnYn+ 1  and 

A 

Xn+i  =  PnXn+ \.  Then  from  (6.4.1)  and  (6.4.2)  we  have 

Xn-\- 1  Xn-\-\  —  Yn+\  Yn- j_i,  72  I? 


and  hence  from  (3.3.12),  if  n  >  m  =  ma x(p,  q)  and  h  >  1,  we  can  write 

p  q 

PnYn+h  —  ^  'jtpYnYn+h—i  T  ^  ]  @n+h—l,j  (^r 


ni+h—j  X-n+h—j 


(6.4.3) 


i=  1 


j=h 


Setting  =  ( 1  —  z)d<fi(z)  =  1  —  (j)\z  —  •  •  •  —  (j)*+dzp+d ,  we  find  from  (6.4.2)  and 

(6.4.3)  that 


p+d  q 

PnYn+h  —  ^  ^  4>j  P rYn+h—j  T  ^  ^  @n+h—l,j  i^n+h—j  -^-n+h—jj 


7—1 


j=h 


(6.4.4) 


which  is  analogous  to  the  h- step  prediction  formula  (3.3.12)  for  an  ARMA  process. 
As  in  (3.3.13),  the  mean  squared  error  of  the  h- step  predictor  is 
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h- 1  /  7 

(^)  —  E(Xn-\-h  ^zz^zz+Zz)  —  ^  ^  |  ^  ^  Xr@n-\-h—r—l,j—r  J  ^zz+Zz— 7—  1? 

7=0  \  r=0 


(6.4.5) 


where  0,zo  =  1, 


oo 


X  (7)  =  J2  XrZr  =  (1  -  01 Z - <Pp+dzP+d) 


-1 


r=0 


and 


^zz+Zz— j—  1  —  E  (  ^Cz+Zz—/  ^zz+Zz—/  )  —  E  (  ^zz+Zz— 7  ^zz+Zz— j  ) 


The  coefficients  //  can  be  found  from  the  recursions  (3.3.14)  with  0*  replacing  0y .  For 
large  zi  we  can  approximate  (6.4.5),  provided  that  $(•)  is  invertible,  by 


°n2W  =  J2tfa2 

j= 0 


(6.4.6) 


where 


00 


0(0  =  53  0A  =  (0*(O)  ^(O- 

7=0 


6.4.1  The  Forecast  Function 

Inspection  of  equation  (6.4.4)  shows  that  for  fixed  n  >  m  =  ma x(p,  g),  the  /z-step 
predictors 

g(/z)  . —  PnXn+h, 

satisfy  the  homogeneous  linear  difference  equations 

g(/z)  -  0*g(/z  -  1) - ct>;+dg(h  -p-d)=  0,  h  >  q,  (6.4.7) 

where  0*,  . . . ,  0*+J  are  the  coefficients  of  z,  . . . ,  zp+d  in 

0*(Z)  =  (1  ~Z)U(Z). 

The  solution  of  (6.4.7)  is  well  known  from  the  theory  of  linear  difference  equations 
(see  Brockwell  and  Davis  (1991),  Section  3.6).  If  we  assume  that  the  zeros  of  0(z) 
(denoted  by  £1,  . . . ,  fp)  are  all  distinct,  then  the  solution  is 

g(h)  —  do  +  a\h  +  •  •  •  +  aj-\hd  1  +  feif  1  h  +  •  •  •  +  bp^p  h,  h  >  q  —  p  —  d, 

(6.4.8) 

where  the  coefficients  ... ,  a^-i  and  b\ ,  . . . ,  bp  can  be  determined  from  th e  p  +  d 
equations  obtained  by  equating  the  right-hand  side  of  (6.4.8)  for  q— p  —  d<h<q 
with  the  corresponding  value  of  g(h)  computed  numerically  (for  h  <  0,  PnXn+h  — 
Xn+h,  and  for  1  —  b  —  q ^  Pn^n-\-h  can  be  computed  from  (6.4.4)  as  already  described). 
Once  the  constants  at  and  bt  have  been  evaluated,  the  algebraic  expression  (6.4.8) 
gives  the  predictors  for  all  h  >  q  —  p  —  d.  In  the  case  q  —  0,  the  values  of  g(h)  in 
the  equations  for  ao,  ... ,  a^-i ,  b\ , . . . ,  bp  are  simply  the  observed  values  g(h)  =Xn+h, 
—p  —  d  <  h  <  0,  and  the  expression  (6.4.6)  for  the  mean  squared  error  is  exact. 
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Example  6.4.1 


The  calculation  of  the  forecast  function  is  easily  generalized  to  deal  with  more 
complicated  ARIMA  processes.  For  example,  if  the  observations  X_i3,  X_i2, . . . ,  Xn 
are  differenced  at  lags  12  and  1,  and  (1  —  B)  (l  —  Bn)Xt  is  modeled  as  a  causal  invertible 
ARMA(p,  q)  process  with  mean  p  and  ma x(p,  q )  <  n ,  then  {Xt}  satisfies  an  equation 
of  the  form 

-  B)(l  -  Bn)Xt  -  n\=d{B)Zt,  {Zt}  ~  WN  (0,  a2) ,  (6.4.9) 

and  the  forecast  function  g(h)  =  PnXn+h  satisfies  the  analogue  of  (6.4.7),  namely, 

<p(B)(l  -  B)(  1  -  Bl2)g(h)  =  <p(l )fi,  h>  q.  (6.4.10) 


To  find  the  general  solution  of  these  inhomogeneous  linear  difference  equations,  it 
suffices  (see  Brockwell  and  Davis  (1991),  Section  3.6)  to  find  one  particular  solution 
of  (6.4.10)  and  then  add  to  it  the  general  solution  of  the  same  equations  with  the  right- 
hand  side  set  equal  to  zero.  A  particular  solution  is  easily  found  (by  trial  and  error) 
to  be 


ph2 
~2A  ’ 


and  the  general  solution  is  therefore 


g(h)  = 


/ih2 

~7A 


li 


4-  h  £~h 

'  UPC>P  ’ 


+  clq  +  ci\h  +  ^  '  CjelJTT/6  +  b\^  ^  H-  *  * 

7=1 

h  >  q  —  p  —  13. 


(6.4.11) 


(The  terms  ao  and  a\h  correspond  to  the  double  root  z  =  1  of  the  equation  0(z)(  1  — 
z)(l  —  z12)  =  0,  and  the  subsequent  terms  to  each  of  the  other  roots,  which  we  assume 
to  be  distinct.)  For  q  —  p  —  13  <  h  <  0,  g(h)  =  Xn+h,  and  for  1  <  h  <  q,  the  values 
of  g(h )  =  PnXn+h  can  be  determined  recursively  from  the  equations 


PnXfi+h  —  M  T  PnXn_\  T  PnXn_\2  PnXn—\2  T  PfiYn+h, 


where  { Yt }  is  the  ARMA  process  Yt  =  (1  —  5)(  1  —  Bi2)Xt  —  p.  Substituting  these 
values  of  g(h)  into  (6.4.11),  we  obtain  a  set  of  p  +  13  equations  for  the  coefficients 
cii,  bj ,  and  q.  Solving  these  equations  then  completes  the  determination  of  g(h). 

The  large-sample  approximation  to  the  mean  squared  error  is  again  given  by 
(6.4.6),  with  xj/j  redefined  as  the  coefficient  of  z'  in  the  power  series  expansion  of 
0{z)/[(  1  -  z)(l  -  zn)(t>{z)\ 


An  ARIMA(  1,1,0)  Model 

In  Example  5.2.4  we  found  the  maximum  likelihood  AR(1)  model  for  the  mean- 
corrected  differences  Xt  of  the  Dow  Jones  Utilities  Index  (August  28-December  18, 
1972).  The  model  was 

-  0.447 1X?_!  =  Zt,  {Ztj  -  WN(0,  0.1455),  (6.4.12) 

where  Xt  =  Dt  —  Dt_ i  —  0.1336,  t  —  1, ...  ,77,  and  { Dt ,  t  —  0,  1,  2, ... ,  77}  is  the 
original  series.  The  model  for  { Dt }  is  thus 

(1  _  0.44715)[(1  -  B)Dt  -  0.1336]  =  Z„  {Zt}  -  WN(0,  0.1455). 

The  recursions  for  g(h)  therefore  take  the  form 

(1  —  0.44715) (1  —B)g(h)  =  0.5529x0.1336  =  0.07387,  h  >  0.  (6.4.13) 

A  particular  solution  of  these  equations  is  g(h)  =  0.1336 h,  so  the  general  solution  is 

g(h)  =  0.1336 h  +a  +  b(0A41l)h,  h  >  -2.  (6.4.14) 
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Substituting  g(—  1)  =  D1 6  =  122  and  g(0)  =  D 77  =  121.23  in  the  equations  with 
h  —  —  1  and  h  —  0,  and  solving  for  a  and  b  gives 

g(h)  =  0.1366 h  +  120.50  +  0.7331  (0.447  \)h . 

Setting  h  —  1  and  h  =  2  gives 

P77D78  =  120.97  and  PnD19  =  120.94. 

From  (6.4.5)  we  find  that  the  corresponding  mean  squared  errors  are 

ct727(  1)  =  v77  =o-2  =  0.1455 

and 

ct77(2)  =  v78  +  </>f  v77  =  cr2  (1  +  1.44712)  =  0.4502. 

(Notice  that  the  approximation  (6.4.6)  is  exact  in  this  case.)  The  predictors  and  their 
mean  squared  errors  are  easily  obtained  from  the  program  ITSM  by  opening  the  file 
DOWJ.TSM,  differencing  at  lag  1,  fitting  a  preliminary  AR(1)  model  to  the  mean- 
corrected  data  with  Burg’s  algorithm,  and  selecting  Model >Estimation>Max 
likelihood  to  find  the  maximum  likelihood  AR(1)  model.  Predicted  values  and 
their  mean  squared  errors  are  then  found  using  the  option  Forecasting>ARMA. 

□ 


6.5  Seasonal  ARIMA  Models 

We  have  already  seen  how  differencing  the  series  {Xr}  at  lag  s  is  a  convenient  way 
of  eliminating  a  seasonal  component  of  period  s.  If  we  fit  an  ARM  A  (p,  q)  model 
cj)(B)Yt  =  9(B)Zt  to  the  differenced  series  Yt  =  (1  —  Bs)Xt ,  then  the  model  for  the 
original  series  is  <fi(B)  (1  —  Bs)Xt  —  9{B)Zt.  This  is  a  special  case  of  the  general 
seasonal  ARIMA  (SARIMA)  model  defined  as  follows. 


Definition  6.5.1 


If  d  and  D  are  nonnegative  integers,  then  {Xr}  is  a  seasonal  ARIMA(/?,  d,  q )  x 
( P,D ,  Q)s  process  with  period  s  if  the  differenced  series  Yt  —  (1—  B)d{\—  Bs)°Xt 
is  a  causal  ARMA  process  defined  by 

4>(B)<5>  (Bs)  Yr  =  0(B)@  (Bs)  Zt,  {Zt}  ~  WN  (o,  a2)  ,  (6.5.1) 

where  0(z)  =  1  —  (p\z  —  •  •  •  —  (j)pzp,  O(z)  =  1  —  z  —  •  •  •  —  <&pzp,  0(z)  = 
1  +  9\z  H - b  9qzq ,  and  @(z)  =  1  +  ©iz  H - b  0 qzq . 


Remark  1.  Note  that  the  process  {Yt}  is  causal  if  and  only  if  <fi{z)  /  0  and  d>(z)  ^  0 
for  \z\  <  1.  In  applications  D  is  rarely  more  than  one,  and  P  and  Q  are  typically  less 
than  three.  □ 

Remark  2.  Equation  (6.5.1)  satisfied  by  the  differenced  process  { Yt }  can  be  rewritten 
in  the  equivalent  form 

<fi*(B)Yt  —  9*(B)Zt,  (6.5.2) 

where  </>*(•),  9*0)  are  polynomials  of  degree  p  +  sP  and  q  +  sQ ,  respectively,  whose 
coefficients  can  all  be  expressed  in  terms  of  0i,  . . . ,  4>p,  d>i,  . . . ,  <Y>P,  9\,  ,  9q ,  and 
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Example  6.5.1 


Example  6.5.2 


@1 ,  . . . ,  @q.  Provided  that p  <  s  and  q  <  s,  the  constraints  on  the  coefficients  of  </>*(•) 
and  #*(•)  can  all  be  expressed  as  multiplicative  relations 

K+j  =  Ktf’  i~  1,2,...; 

and 


e 


* 

is+j 


A*  A* 

UisUj  » 


In  Section  1.5  we  discussed  the  classical  decomposition  model  incorporating  trend, 
seasonality,  and  random  noise,  namely,  Xt  —  mt  +  st  +  Yt.  In  modeling  real  data 
it  might  not  be  reasonable  to  assume,  as  in  the  classical  decomposition  model,  that 
the  seasonal  component  st  repeats  itself  precisely  in  the  same  way  cycle  after  cycle. 
Seasonal  ARIMA  models  allow  for  randomness  in  the  seasonal  pattern  from  one  cycle 
to  the  next.  □ 


Suppose  we  have  r  years  of  monthly  data,  which  we  tabulate  as  follows: 


Year/Month 

i 

2 

...  12 

i 

Pi 

*2 

...  Y  \  2 

2 

Pl3 

P|4 

Y24 

3 

• 

• 

*25 

• 

• 

*26 

• 

• 

•  •  •  136 

• 

• 

• 

r 

• 

f  l  +  12(r- 

• 

-1)  ^2+120-1) 

• 

f  12+12(r— 1) 

Each  column  in  this  table  may  itself  be  viewed  as  a  realization  of  a  time  series.  Suppose 
that  each  one  of  these  twelve  time  series  is  generated  by  the  same  ARMA(P,  Q ) 
model,  or  more  specifically,  that  the  series  corresponding  to  the  jth  month,  Yj+yit, 
t  —  0,  . . . ,  r  —  1,  satisfies  a  difference  equation  of  the  form 


Yj+\lt  —  +  •  •  •  +  ®pYj+\2(t-P)  +  Uj+\2t 

+©i  +  •  •  •  +  ®QUj+\2{t-Q),  (6.5.3) 

where 


{Uj+\2u  t  — 


-1,0,  1,...}-WN  (0,  aQ. 


(6.5.4) 


Then  since  the  same  ARMA(P,  Q )  model  is  assumed  to  apply  to  each  month,  (6.5.3) 
holds  for  each  j  =  1,  . . . ,  12.  (Notice,  however,  that  E(UtUt+h)  is  not  necessarily 
zero  except  when  h  is  an  integer  multiple  of  12.)  We  can  thus  write  (6.5.3)  in  the 
compact  form 

<f>  (B12)  Y,  =  ©  (B12)  U„  (6.5.5) 

where  d>(z)  =  1  —  d>iz - ®pzp,  0(z)  =  1  +  0iZ  H - h  &qZq,  and  {Uj+ m,  t  = 

. . . ,  —  1,  0,  1, . . .}  ~  WN  (0,  op)  for  each  j.  We  refer  to  the  model  (6.5.5)  as  the 
between-year  model. 

□ 


Suppose  P  =  0,  Q  =  1,  and  ©i  =  —0.4  in  (6.5.5).  Then  the  series  for  any  particular 
month  is  a  moving-average  of  order  1.  If  E(UtUt+h)= 0  for  all  h ,  i.e.,  if  the  white  noise 
sequences  for  different  months  are  uncorrelated  with  each  other,  then  the  columns 
themselves  are  uncorrelated.  The  correlation  function  for  such  a  process  is  shown  in 
Figure  6-15. 

□ 
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Figure  6-15 

The  ACF  of  the  model 
X{  =  Ut  —  0AUt--\2 
of  Example  6.5.2 


Figure  6-16 

The  ACF  of  the  model 
Xt-0 .7Xt_u  =  Ut 
of  Example  6.5.3 


Example  6.5.3 
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Suppose  P  —  1,  Q  —  0,  and  dq  =  0.7  in  (6.5.5).  In  this  case  the  12  series  (one  for 
each  month)  are  AR(1)  processes  that  are  uncorrelated  if  the  white  noise  sequences 
for  different  months  are  uncorrelated.  A  graph  of  the  autocorrelation  function  of  this 
process  is  shown  in  Figure  6-16. 

□ 

In  each  of  the  Examples  6.5. 1-6. 5. 3,  the  12  series  corresponding  to  the  dif¬ 
ferent  months  are  uncorrelated.  To  incorporate  dependence  between  these  series 
we  allow  the  process  {Ut}  in  (6.5.5)  to  follow  an  ARMA (p,  q )  model, 

<P(B)U,  =  6{B)Z„  {Z,}  -  WN  (0,  a2) .  (6.5.6) 

This  assumption  implies  possible  nonzero  correlation  not  only  between  consecutive 
values  of  Uu  but  also  within  the  12  sequences  {Uj+\2u  t  =  . . . ,  —  1,  0,  1,  . . .},  each  of 
which  was  assumed  to  be  uncorrelated  in  the  preceding  examples.  In  this  case  (6.5.4) 
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may  no  longer  hold;  however,  the  coefficients  in  (6.5.6)  will  frequently  have  values 
such  that  E(JJtUt+\2j)  is  small  for  j  —  ±1,  ±2,  ....  Combining  the  two  models  (6.5.5) 
and  (6.5.6)  and  allowing  for  possible  differencing  leads  directly  to  Definition  6.5.1  of 
the  general  SARIMA  model  as  given  above. 

The  first  steps  in  identifying  SARIMA  models  for  a  (possibly  transformed)  data 
set  are  to  find  d  and  D  so  as  to  make  the  differenced  observations 

Y,=  (1  -B)d(l  -BsfXt 

stationary  in  appearance  (see  Sections  6. 1-6.3).  Next  we  examine  the  sample  ACF 
and  PACF  of  {Fr}  at  lags  that  are  multiples  of  s  for  an  indication  of  the  orders  P  and 
Q  in  the  model  (6.5.5).  If  p(-)  is  the  sample  ACF  of  {Fr},  then  P  and  Q  should  be 
chosen  such  that  p(ks),  k  =  1,  2,  . . .,  is  compatible  with  the  ACF  of  an  ARMA(P,  Q ) 
process.  The  orders  p  and  q  are  then  selected  by  trying  to  match  p(l), . . . ,  p(s  —  1) 
with  the  ACF  of  an  ARM  A  (p,  q )  process.  Ultimately,  the  AICC  criterion  (Section  5.5) 
and  the  goodness  of  fit  tests  (Section  5.3)  are  used  to  select  the  best  SARIMA  model 
from  competing  alternatives. 

For  given  values  of  p,  d ,  q ,  P,  D,  and  Q ,  the  parameters  </>,  0,  O,  0,  and  a2  can 
be  found  using  the  maximum  likelihood  procedure  of  Section  5.2.  The  differences 

Yt  =  (1  —  B)d(  1  -  Bs)L  '*Xt  constitute  an  ARM  A  (p  +  sP,  q  +  sQ )  process  in  which 
some  of  the  coefficients  are  zero  and  the  rest  are  functions  of  the  (p  +  P  +  q  +  Q)- 
dimensional  vector  (3 '  —  (<//,  O',  6r ,  &).  For  any  fixed  (3  the  reduced  likelihood  l((3) 
of  the  differences  Yl+d+sD,  . . . ,  Yn  is  easily  computed  as  described  in  Section  5.2.  The 
maximum  likelihood  estimator  of  (3  is  the  value  that  minimizes  f(/3),  and  the 
maximum  likelihood  estimate  of  a2  is  given  by  (5.2.10).  The  estimates  can  be  found 
using  the  program  ITSM  by  specifying  the  required  multiplicative  relationships  among 
the  coefficients  as  given  in  Remark  2  above. 

A  more  direct  approach  to  modeling  the  differenced  series  { Yt }  is  simply  to  fit  a 
subset  ARMA  model  of  the  form  (6.5.2)  without  making  use  of  the  multiplicative  form 
of  </>*(•)  and  #*(•)  in  (6.5.1). 

Example  6.5.4  Monthly  Accidental  Deaths 

In  Figure  1-27  we  showed  the  series  { Yt  —  ( 1  —  Bn)  ( 1  —B)Xt }  obtained  by  differencing 
the  accidental  deaths  series  {Xt}  once  at  lag  12  and  once  at  lag  1.  The  sample  ACF  of 
{Fr}  is  shown  in  Figure  6-17. 

□ 

The  values  p(12)  =  —0.333,  p(24)  =  —0.099,  and  p(36)  =  0.013  suggest  a 
moving-average  of  order  1  for  the  between-year  model  (i.e.,  P  —  0  and  Q  —  1). 
Moreover,  inspection  of  p(l),...,p(ll)  suggests  that  p(l)  is  the  only  short-term 
correlation  different  from  zero,  so  we  also  choose  a  moving-average  of  order  1  for 
the  between-month  model  (i.e.,  p  =  0  and  q  =  1).  Taking  into  account  the  sample 
mean  (28.831)  of  the  differences  {Fr},  we  therefore  arrive  at  the  model 

Y,  =  28.831  +  (1  +  exB){\  +  0i Bn)Z„  {Z,}  ~  WN  (0,  a2) , 

(6.5.7) 

for  the  series  {Fr}.  The  maximum  likelihood  estimates  of  the  parameters  are  obtained 
from  ITSM  by  opening  the  file  DEATHS. TSM  and  proceeding  as  follows.  After 
differencing  (at  lags  1  and  12)  and  then  mean-correcting  the  data,  choose  the  option 
Model >Specify.  In  the  dialog  box  enter  an  MA(13)  model  with  6\  —  —0.3, 
0\2  =  —0.3,  0i3  =  0.09,  and  all  other  coefficients  zero.  (This  corresponds  to  the  initial 
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Figure  6-17 

The  sample  ACF  of  the 
differenced  accidental 
deaths  {VV^Af} 


guess  Yt  =  (1  —  0.3Z?)(l  —  0.3 Bn)Zt.)  Then  choose  Model >Estimation>Max 
likelihood  and  click  on  the  button  Constrain  optimization.  Specify  the 
number  of  multiplicative  relations  (one  in  this  case)  in  the  box  provided  and  define  the 
relationship  by  entering  1,  12,  13  to  indicate  that  6\  x  012  =  #13.  Click  OK  to  return 
to  the  Maximum  Likelihood  dialog  box.  Click  OK  again  to  obtain  the  parameter 
estimates 

§1  =  -0.478, 

©!  =  -0.591, 
and 

a2  =  94,  255, 

with  AICC  value  855.53.  The  corresponding  fitted  model  for  {Xr}  is  thus  the  SARIMA 
(0,  1,  1)  x  (0,  1,  1)  12  process 

VV12Xr  =  28.831  +  (1  -  0.4785)  (l  -  0.591512)  Z„  (6.5.8) 

where  {Zt}  ~  WN(0,  94390). 

If  we  adopt  the  alternative  approach  of  fitting  a  subset  ARM  A  model  to  {Yt} 
without  seeking  a  multiplicative  structure  for  the  operators  </>*(£)  and  6*(B)  in  (6.5.2), 
we  begin  by  fitting  a  preliminary  MA(13)  model  (as  suggested  by  Figure  6-17)  to 
the  series  {FJ.  We  then  fit  a  maximum  likelihood  MA(13)  model  and  examine  the 
standard  errors  of  the  coefficient  estimators.  This  suggests  setting  the  coefficients  at 
lags  2, 3,  8, 10,  and  1 1  equal  to  zero,  since  these  are  all  less  than  one  standard  error  from 
zero.  To  do  this  select  Model>Estimation>Max  likelihood  and  click  on  the 
button  Constrain  optimization.  Then  highlight  the  coefficients  to  be  set  to 
zero  and  click  on  the  button  Set  to  zero.  Click  OK  to  return  to  the  Maximum 
Likelihood  Estimation  dialog  box  and  again  to  carry  out  the  constrained 
optimization.  The  coefficients  that  have  been  set  to  zero  will  be  held  at  that  value,  and 
the  optimization  will  be  with  respect  to  the  remaining  coefficients.  This  gives  a  model 
with  substantially  smaller  AICC  than  the  unconstrained  MA(13)  model.  Examining 
the  standard  errors  again  we  see  that  the  coefficients  at  lags  4,  5,  and  7  are  promising 
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candidates  to  be  set  to  zero,  since  each  of  them  is  less  than  one  standard  error  from 
zero.  Setting  these  coefficients  to  zero  in  the  same  way  and  reoptimizing  gives  a  further 
reduction  in  AICC.  Setting  the  coefficient  at  lag  9  to  zero  and  reoptimizing  again  gives 
a  further  reduction  in  AICC  (to  855.61)  and  the  fitted  model 

VVi2X,  =  28.831  +  Zr  -  0.596Zr_!  -  0.407Z,_6  -  0.685Zr_12  +  0.460Zr_i3, 

{Zt}  ~  WN(0,  71240).  (6.5.9) 

The  AICC  value  855.61  is  quite  close  to  the  value  855.53  for  the  model  (6.5.8).  The 
residuals  from  the  two  models  are  also  very  similar,  the  randomness  tests  (with  the 
exception  of  the  difference- sign  test)  yielding  high  /?-values  for  both. 


6.5.1  Forecasting  SARIMA  Processes 

Forecasting  SARIMA  processes  is  completely  analogous  to  the  forecasting  of  ARIMA 

processes  discussed  in  Section  6.4.  Expanding  out  the  operator  (1  —  B)d  (l  —  Bs)°  in 
powers  of  B ,  rearranging  the  equation 

(1  -B)d(\  -BsfXt  =  Y„ 


and  setting  t  =  n  +  h  gives  the  analogue 

d+Ds 

Xn+h  —  fft+Zz  +  E  djXn-\-h—j 

7=1 


(6.5.10) 


of  equation  (6.4.2).  Under  the  assumption  that  the  first  d  +  Ds  observations  X-d-Ds+i, 
. . . ,  Xo  are  uncorrelated  with  {Yt,  t  >  1},  we  can  determine  the  best  linear  predictors 
PnXn+h  ofXn+h  based  on  {1,  X_d_Ds+x,  . . . ,  Xn]  by  applying  Pn  to  each  side  of  (6.5.10) 
to  obtain 


P  nXn+h  —  PnY] 


n 1  n+h 


d+Ds 

T  ^  ^  Pn  Xn + h  —  j  • 

7=1 


(6.5.11) 


The  first  term  on  the  right  is  just  the  best  linear  predictor  of  the  (possibly  nonzero- 
mean)  ARMA  process  {F?}  in  terms  of  {1,  Y\, . . . ,  Yn },  which  can  be  calculated  as 
described  in  Section  3.3.  The  predictors  PnXn+h  can  then  be  computed  recursively  for 
h  —  1,2,...  from  (6.5.11),  if  we  note  that  PnXn+\_j  —  Xn+\ _y  for  each  j  >  1. 

An  argument  analogous  to  the  one  leading  to  (6.4.5)  gives  the  prediction  mean 
squared  error  as 


ahh)  =  E(Xn+h  -  PnXn+h )2  = 


h-i  /  j  \ 

/  I  2.  /  X-r^n+h—r—l,j—r  I 

j= 0  V  r= 0  / 


Vn+h—j—  1  ? 


(6.5.12) 


where  9nj  and  vn  are  obtained  by  applying  the  innovations  algorithm  to  the  differenced 
series  {Yt}  and 


oo 

X(z)  =  XjXrZ-r 

r= 0 


</»(z)<D(zO(l-z)d(l-^f] 


<  1. 
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Table  6.1  Predicted  values  of  the  Accidental  Deaths  series  for 

t  =  73,...,  78,  the  standard  deviations  crt  of  the 
prediction  errors,  and  the  corresponding  observed 
values  of  Xt  f  r  the  same  period 


t 

73 

74 

75 

76 

77 

78 

Model  (6.5.8) 

Predictors 

8441 

7704 

8549 

8885 

9843 

10279 

308 

348 

383 

415 

445 

474 

Model  (6.5.9) 

Predictors 

8345 

7619 

8356 

8742 

9795 

10179 

<*t 

292 

329 

366 

403 

442 

486 

Observed  values 

Xt 

7798 

7406 

8363 

8460 

9217 

9316 

For  large  n  we  can  approximate  (6.5.12),  if  9(z)&  (zs)  is  nonzero  for  all  |z|  <  1,  by 


h- 1 


an(h)  =  Yhtf°' 

j= 0 


where 


(6.5.13) 


m®&)  <  j 

</>(z)d>  (z5)  (1  -  z)d  (1  -  zs)D ’ 

The  required  calculations  can  all  be  carried  out  with  the  aid  of  the  program  ITSM. 
The  mean  squared  errors  are  computed  from  the  large-sample  approximation  (6.5.13) 
if  the  fitted  model  is  invertible.  If  the  fitted  model  is  not  invertible,  ITSM  computes  the 
mean  squared  errors  by  converting  the  model  to  the  equivalent  (in  terms  of  Gaussian 
likelihood)  invertible  model  and  then  using  (6.5.13). 


f(z)  = 

j= 0 


Example  6.5.5  Monthly  Accidental  Deaths 

Continuing  with  Example  6.5.4,  we  next  use  ITSM  to  predict  six  future  values  of 
the  Accidental  Deaths  series  using  the  fitted  models  (6.5.8)  and  (6.5.9).  First  fit  the 
desired  model  as  described  in  Example  6.5.4  or  enter  the  data  and  model  directly 
by  opening  the  file  DEATHS. TSM,  differencing  at  lags  12  and  1,  subtracting  the 
mean,  and  then  entering  the  MA(13)  coefficients  and  white  noise  variance  using  the 
option  Mode  1>  Specify.  Select  Forecasting>ARMA,  and  you  will  see  the  ARMA 
Forecast  dialog  box.  Enter  6  for  the  number  of  predicted  values  required.  You  will 
notice  that  the  default  options  in  the  dialog  box  are  set  to  generate  predictors  of  the 
original  series  by  reversing  the  transformations  applied  to  the  data.  If  for  some  reason 
you  wish  to  predict  the  transformed  data,  these  check  marks  can  be  removed.  If  you 
wish  to  include  prediction  bounds  in  the  graph  of  the  predictors,  check  the  appropriate 
box  and  specify  the  desired  coefficient  (e.g.,  95  %).  Click  OK,  and  you  will  see  a 
graph  of  the  data  with  the  six  predicted  values  appended.  For  numerical  values  of 
the  predictors  and  prediction  bounds,  right-click  on  the  graph  and  then  on  Info.  The 
prediction  bounds  are  computed  under  the  assumption  that  the  white  noise  sequence  in 
the  ARMA  model  for  the  transformed  data  is  Gaussian.  Table  6.1  shows  the  predictors 
and  standard  deviations  of  the  prediction  errors  under  both  models  (6.5.8)  and  (6.5.9) 
for  the  Accidental  Deaths  series. 


□ 
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6.6  Regression  with  ARMA  Errors 

6.6.1  OLS  and  GLS  Estimation 

In  standard  linear  regression,  the  errors  (or  deviations  of  the  observations  from  the 
regression  function)  are  assumed  to  be  independent  and  identically  distributed.  In 
many  applications  of  regression  analysis,  however,  this  assumption  is  clearly 
violated,  as  can  be  seen  by  examination  of  the  residuals  from  the  fitted  regression 
and  their  sample  autocorrelations.  It  is  often  more  appropriate  to  assume  that  the 
errors  are  observations  of  a  zero-mean  second-order  stationary  process.  Since  many 
autocorrelation  functions  can  be  well  approximated  by  the  autocorrelation  function  of 
a  suitably  chosen  ARMA (p,  q)  process,  it  is  of  particular  interest  to  consider  the  model 

Yt  =  x't/3  +  Wt ,  f=l,...,#i,  (6.6.1) 

or  in  matrix  notation, 

Y  =  Xf3  +  W,  (6.6.2) 

where  Y  =  (Y\, ... ,  Yn)'  is  the  vector  of  observations  at  times  t  —  1 ,  ,n,  X 

is  the  design  matrix  whose  ti h  row,  x't  =  (xt\,  . . . ,  xtk),  consists  of  the  values  of 
the  explanatory  variables  at  time  t,  (3  =  (/3\,  . . . ,  (3k)'  is  the  vector  of  regression 
coefficients,  and  the  components  of  W  =  (W\, . . . ,  Wn)'  are  values  of  a  causal  zero- 
mean  ARMA (p,  q )  process  satisfying 

<P(B)W,  =  0{B)Z„  {Zr}  ~  WN  (0,  a2)  .  (6.6.3) 

The  model  (6.6.1)  arises  naturally  in  trend  estimation  for  time  series  data.  For 
example,  the  explanatory  variables  xt\  —  1 ,  xt2  =  t ,  and  x^  =  t 2  can  be  used  to 
estimate  a  quadratic  trend,  and  the  variables  xt\  —  1,  xt2  =  cos  (cot),  and  xt 3  =  sin  (cot) 
can  be  used  to  estimate  a  sinusoidal  trend  with  frequency  00.  The  columns  of  X  are 
not  necessarily  simple  functions  of  t  as  in  these  two  examples.  Any  specified  column 
of  relevant  variables,  e.g.,  temperatures  at  times  t  =  1,  . . . ,  n,  can  be  included  in  the 
design  matrix  X ,  in  which  case  the  regression  is  conditional  on  the  observed  values  of 
the  variables  included  in  the  matrix. 

A 

The  ordinary  least  squares  (OLS)  estimator  of  (3  is  the  value,  /3ols>  which 
minimizes  the  sum  of  squares 

n 

(Y  -  Xf3)'( Y  -XP)  =  J2  ( Y‘  ~  X>P)2  ■ 

t=  1 

Equating  to  zero  the  partial  derivatives  with  respect  to  each  component  of  (3  and 
assuming  (as  we  shall)  that  X'X  is  nonsingular,  we  find  that 

/30ls  =  (X’Xr'X’Y.  (6.6.4) 

(If  X'X  is  singular,  /3ols  is  not  uniquely  determined  but  still  satisfies  (6.6.4)  with 
(Y^)-1  any  generalized  inverse  of  X'X.)  The  OLS  estimate  also  maximizes  the 
likelihood  of  the  observations  when  the  errors  W\,  ...  ,Wn  are  iid  and  Gaussian.  If 
the  design  matrix  X  is  nonrandom,  then  even  when  the  errors  are  non-Gaussian  and 
dependent,  the  OLS  estimator  is  unbiased  (i.e.,  qls)  =  (3)  and  its  covariance 
matrix  is 

Cov(/3ols)  =  (x'x)~'  x'rnx  (x'x)~l , 

where  r/?  =  f(WW')  is  the  covariance  matrix  of  W. 


(6.6.5) 
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The  generalized  least  squares  (GLS)  estimator  of  (3  is  the  value  /3GLS  that 
minimizes  the  weighted  sum  of  squares 

(Y  -  Xp)'  r-1  (Y  -  Xp) .  (6.6.6) 

Differentiating  partially  with  respect  to  each  component  of  (3  and  setting  the  deriva¬ 
tives  equal  to  zero,  we  find  that 

Ajls  =  (X'W'Xf'  XT^Y.  (6.6.7) 

If  the  design  matrix  X  is  nonrandom,  the  GLS  estimator  is  unbiased  and  has  covariance 
matrix 

Cov  (/3gls)  =  {X'T-lX)~l .  (6.6.8) 

It  can  be  shown  that  the  GLS  estimator  is  the  best  linear  unbiased  estimator  of  /3,  i.e., 

A 

for  any  ^-dimensional  vector  c  and  for  any  unbiased  estimator  (3  of  (3  that  is  a  linear 
function  of  the  observations  Y\,  ...  ,Yn, 

Var  (c'^gls)  <  Var  (c'A)  • 

In  this  sense  the  GLS  estimator  is  therefore  superior  to  the  OLS  estimator.  However, 
it  can  be  computed  only  if  </>  and  6  are  known. 

Let  V(</>,  6)  denote  the  matrix  c>~2Yn  and  let  7\</>,  0)  be  any  square  root  of  V~{ 
(i.e.,  a  matrix  such  that  T'T  —  V~[).  Then  we  can  multiply  each  side  of  (6.6.2)  by  T 
to  obtain 

TY  =  TX/3  +  TW,  (6.6.9) 

a  regression  equation  with  coefficient  vector  /3,  data  vector  TY ,  design  matrix  TX,  and 
error  vector  TW.  Since  the  latter  has  uncorrelated,  zero-mean  components,  each  with 
variance  a2,  the  best  linear  estimator  of  / 3  in  terms  of  TY  (which  is  clearly  the  same 

A 

as  the  best  linear  estimator  of  {3  in  terms  of  Y,  i.e.,  /3gls)  can  be  obtained  by  applying 
OLS  estimation  to  the  transformed  regression  equation  (6.6.9).  This  gives 

/3gls  =  {X’T’TXy1  X’T’TY,  (6.6.10) 

which  is  clearly  the  same  as  (6.6.7).  Cochran  and  Orcutt  (1949)  pointed  out  that  if  { W^} 
is  an  AR (p)  process  satisfying 

(p(B)W,  —  Zt,  {Z()~WN(0,(t2), 

then  application  of  4>{B)  to  each  side  of  the  regression  equations  (6.6.1)  transforms 
them  into  regression  equations  with  uncorrelated,  zero-mean,  constant- variance  errors, 
so  that  ordinary  least  squares  can  again  be  used  to  compute  best  linear  unbiased 
estimates  of  the  components  of  (3  in  terms  of  Y*  —  <fi(B)Yt,  t  —  p  +  1, . . . ,  n.  This 
approach  eliminates  the  need  to  compute  the  matrix  T  but  suffers  from  the  drawback 
that  Y*  does  not  contain  all  the  information  in  Y.  Cochrane  and  Orcutt’s  transformation 
can  be  improved,  and  at  the  same  generalized  to  ARMA  errors,  as  follows. 

Instead  of  applying  the  operator  <j)(B)  to  each  side  of  the  regression  equations 
(6.6.1),  we  multiply  each  side  of  equation  (6.6.2)  by  the  matrix  T(0,  6)  that  maps  {Wt} 
into  the  residuals  [see  (5.3.1)]  of  from  the  ARMA  model  (6.6.3).  We  have  already 
seen  how  to  calculate  these  residuals  using  the  innovations  algorithm  in  Section  3.3. 
To  see  that  T  is  a  square  root  of  the  matrix  V  as  defined  in  the  previous  paragraph,  we 
simply  recall  that  the  residuals  are  uncorrelated  with  zero  mean  and  variance  a2,  so 
that 


Cov(TW)  =  7T„r  =  a2I, 
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where  I  is  the  n  x  n  identity  matrix.  Hence 
T'T  =  a2V~l  =  V~l. 

GLS  estimation  of  (3  can  therefore  be  carried  out  by  multiplying  each  side  of  (6.6.2)  by 
T  and  applying  ordinary  least  squares  to  the  transformed  regression  model.  It  remains 
only  to  compute  TY  and  TX. 

Any  data  vector  d  =  (d\ ,  . . . ,  dn )'  can  be  left-multiplied  by  T  simply  by  reading  it 
into  ITSM,  entering  the  model  (6.6.3),  and  pressing  the  green  button  labeled  RES, 
which  plots  the  residuals.  (The  calculations  are  performed  using  the  innovations 

A 

algorithm  as  described  in  Section  3.3.)  The  GLS  estimator  /3qls  is  computed  as 
follows.  The  data  vector  Y  is  left-multiplied  by  T  to  generate  the  transformed  data 
vector  Y*,  and  each  column  of  the  design  matrix  X  is  left-multiplied  by  T  to  generate 
the  corresponding  column  of  the  transformed  design  matrix  X*.  Then 

/3gls=(x*'x*\  X*'Y*.  (6.6.11) 

A 

The  calculations  of  Y*,  X*,  and  hence  of  /3qls>  are  all  carried  out  by  the  program  ITSM 
in  the  option  Regression>Estimation>Generalized  LS. 


6.6.2  ML  Estimation 


If  (as  is  usually  the  case)  the  parameters  of  the  ARMA (p,  q )  model  for  the  errors 
are  unknown,  they  can  be  estimated  together  with  the  regression  coefficients  by 

maximizing  the  Gaussian  likelihood 


L  (/3,  0,  0,  a2)  =  (2tt)  ZZ//2(det  Fn)  1//2  exp  < 


where  Tn  (0,  0,  o'2)  is  the  covariance  matrix  of  W  =  Y  —  X/3.  Since  {Wt}  is  an 
ARMA(/?,  q )  process  with  parameters  (0,  0,  o'2),  the  maximum  likelihood  estimators 

yv  yv  yv 

(3,  4>,  and  6  are  found  (as  in  Section  5.2)  by  minimizing 


l{(3,  4>,  0) 

where 


ln(n  1S(j3,<j>,6))  +n  1  y^lnr,_|. 

r=l 


(6.6.12) 


1 1  ^ 

S(f3,  (f>,  6)  =  J2  (Wi  -  Wt)  /rt-\, 

t=  1 

Wt  is  the  best  one-step  predictor  of  Wt,  and  rt_\cr  is  its  mean  squared  error.  The  func¬ 
tion  l(/3,  0,  0)  can  be  expressed  in  terms  of  the  observations  {Yt}  and  the  parameters  /3, 
0,  and  0  using  the  innovations  algorithm  (see  Section  3.3)  and  minimized  numerically 

yv  yv  /v 

to  give  the  maximum  likelihood  estimators,  /3,  0,  and  0.  The  maximum  likelihood 

estimator  of  a1  is  then  given,  as  in  Section  5.2,  by  d2  —  S  (j3,  0,  0^  /n. 

An  extension  of  an  iterative  scheme,  proposed  by  Cochran  and  Orcutt  (1949)  for 
the  case  q  =  0,  simplifies  the  minimization  considerably.  It  is  based  on  the  observation 

y\ 

that  for  fixed  0  and  0,  the  value  of  (3  that  minimizes  l((3,  0,  0)  is  /3qls(0>  0),  which 
can  be  computed  algebraically  from  (6.6.11)  instead  of  by  searching  numerically  for 
the  minimizing  value.  The  scheme  is  as  follows. 


yv  y\ 

(i)  Compute  /3qls  and  the  estimated  residuals  Yt  —  xz/3ols>  t  —  1 , ,n. 
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(ii)  Fit  an  ARMA (p.q)  model  by  maximum  Gaussian  likelihood  to  the  estimated 
residuals. 

A 

(iii)  For  the  fitted  ARMA  model  compute  the  corresponding  estimator  /3Gls  from 
(6.6.11). 

(iv)  Compute  the  residuals  Yt  —  x'/3gls>  t  —  1 , ,n,  and  return  to  (ii),  stopping 
when  the  estimators  have  stabilized. 

If  {W/}  is  a  causal  and  invertible  ARMA  process,  then  under  mild  conditions  on 
the  explanatory  variables  xt,  the  maximum  likelihood  estimates  are  asymptotically 
multivariate  normal  (see  Fuller  1976).  In  addition,  the  estimated  regression  coefficients 
are  asymptotically  independent  of  the  estimated  ARMA  parameters. 

The  large-sample  covariance  matrix  of  the  ARMA  parameter  estimators,  suitably 
normalized,  has  a  complicated  form  that  involves  both  the  regression  variables  xt  and 
the  covariance  function  of  {W/}.  It  is  therefore  convenient  to  estimate  the  covariance 
matrix  as  —H~l,  where  H  is  the  Hessian  matrix  of  the  observed  log-likelihood 
evaluated  at  its  maximum. 

The  OLS,  GLS,  and  maximum  likelihood  estimators  of  the  regression  coefficients 
all  have  the  same  asymptotic  covariance  matrix,  so  in  this  sense  the  dependence  does 
not  play  a  major  role.  However,  the  asymptotic  covariance  of  both  the  OLS  and  GLS 
estimators  can  be  very  inaccurate  if  the  appropriate  covariance  matrix  Fn  is  not  used  in 
the  expressions  (6.6.5)  and  (6.6.8).  This  point  is  illustrated  in  the  following  examples. 

Remark  1.  The  use  of  the  innovations  algorithm  for  GLS  and  ML  estimation  extends 
to  regression  with  ARIMA  errors  (see  Example  6.6.3  below)  and  FARIMA  errors 
(FARIMA  processes  are  defined  in  Section  10.5).  □ 

Example  6.6.1  The  Overshort  Data 

The  analysis  of  the  overshort  data  in  Example  3.2.8  suggested  the  model 

Yt  =  P  +  Wt, 

where  —  (3  is  interpreted  as  the  daily  leakage  from  the  underground  storage  tank  and 
{IT/}  is  the  MA(1)  process 

w,  =  Z,  +  0Zt_x ,  {Z,}  ~  WN  (0,  a2)  . 

(Here  k  =  1  and  xt\  =  1.)  The  OLS  estimate  of  is  simply  the  sample  mean  /30ls  = 
Yn  —  —4.035.  Under  the  assumption  that  {IT/}  is  iid  noise,  the  estimated  variance 
of  the  OLS  estimator  of  ft  is  yy(0)/57  =  59.92.  However,  since  this  estimate  of  the 
variance  fails  to  take  dependence  into  account,  it  is  not  reliable. 

To  find  maximum  Gaussian  likelihood  estimates  of  /3  and  the  parame¬ 
ters  of  {Wt}  using  ITSM,  open  the  file  OSHORTS.TSM,  select  the  option 
Regression>Specify  and  check  the  box  marked  Include  intercept 
term  only.  Then  press  the  blue  GLS  button  and  you  will  see  the  estimated  value 
of  j3.  (This  is  in  fact  the  same  as  the  OLS  estimator  since  the  default  model  in  ITSM 
is  WN(0,I).)  Then  select  Model >Estimation>Autof  it  and  press  Start.  The 
autofit  option  selects  the  minimum  AICC  model  for  the  residuals, 

Wt  —  Zt  —  0.818Z/_i,  {Z/}  -  WN(0,  2041),  (6.6.13) 

~/(Y) 

and  displays  the  estimated  MA  coefficient  6\  —  —0.818  and  the  corresponding  GLS 

estimate  pGGS  —  —4.745,  with  a  standard  error  of  1.188,  in  the  Regression 
estimates  window.  (If  we  reestimate  the  variance  of  the  OLS  estimator,  using 
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(6.6.5)  with  r 57  computed  from  the  model  (6.6.13),  we  obtain  the  value  2.214,  a  drastic 
reduction  from  the  value  59.92  obtained  when  dependence  is  ignored.  For  a  positively 
correlated  time  series,  ignoring  the  dependence  would  lead  to  underestimation  of  the 
variance.) 

Pressing  the  blue  MLE  button  will  reestimate  the  MA  parameters  using  the 
residuals  from  the  updated  regression  and  at  the  same  time  reestimate  the  regression 
coefficient,  printing  the  new  parameters  in  the  Regression  estimates  window. 
After  this  operation  has  been  repeated  several  times,  the  parameters  will  stabilize,  as 
shown  in  Table  6.2.  Estimated  95  %  confidence  bounds  for  /3  using  the  GLS  estimate 
are  —4.75  d=  1.96(1.408)1/2  =  (—7.07,  —2.43),  strongly  suggesting  that  the  storage 
tank  has  a  leak.  Such  a  conclusion  would  not  have  been  reached  without  taking  into 
account  the  dependence  in  the  data. 

□ 


Table  6.2  Estimates  of  /3  and 

for  the  overshort  data  of 
Example  6.6.1 


Iteration  / 

<9(0 

pf 

0 

0 

-4.035 

1 

-0.818 

-4.745 

2 

-0.848 

-4.780 

3 

-0.848 

-4.780 

Example  6.6.2  The  Lake  Data 

In  Examples  5.2.4  and  5.5.2  we  found  maximum  likelihood  ARMA(1,1)  and  AR(2) 
models  for  the  mean-corrected  lake  data.  Now  let  us  consider  fitting  a  linear  trend  to 
the  data  with  AR(2)  noise.  The  choice  of  an  AR(2)  model  was  suggested  by  an  analysis 
of  the  residuals  obtained  after  removing  a  linear  trend  from  the  data  using  OLS.  Our 
model  now  takes  the  form 

Yt  =  Po  +  Pit  +  Wt, 

where  {VFJ  is  the  AR(2)  process  satisfying 

Wt  =  fa  wt- 1  +  </>2  Wr_2  +  z„  {Zr}  ~  WN  (0,  a2)  . 

From  Example  1.3.5,  we  find  that  the  OLS  estimate  of  (3  is  /3ols=(1' 0.202,  — 0.0242)7 
If  we  ignore  the  correlation  structure  of  the  noise,  the  estimated  covariance  matrix  Tn 
of  W  is  y  ( 0)1  (where  I  is  the  identity  matrix).  The  corresponding  estimated  covariance 

A 

matrix  of  /3qls  is  (from  (6.6.5)) 


M  o)  (x'x) 


M  0) 


*  £”=i  t " 

-1 

'  0.07203  -0.00110" 

E”=i  *  E"=i  '2_ 

-0.00110  0.00002  _ 

(6.6.14) 


However,  the  estimated  model  for  the  noise  process,  found  by  fitting  an  AR(2)  model 

A 

to  the  residuals  Yt  —  f3'OLSxt,  is 

Wt  =  1.008W,_i  -  0.295W,-2  +  Zf,  {Z,}  -  WN(0,  0.4571). 
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Table  6.3  Estimates  of  (3  and  0  for  the  lake  data 

after  3  iterations 


Iteration  / 

$ 

0 

0 

0 

10.20 

-0.0242 

1 

1.008 

-0.295 

10.09 

-0.0216 

2 

1.005 

-0.291 

10.09 

-0.0216 

Assuming  that  this  is  the  true  model  for  {Wt},  the  GLS  estimate  is  found  to  be 
(10.091,  — 0.0216)7,  in  close  agreement  with  the  OLS  estimate.  The  estimated  covari¬ 
ance  matrices  for  the  OLS  and  GLS  estimates  are  given  by 

0.22177  -0.00335" 

-0.00335  0.00007  _ 

and 

0.21392  -0.00321" 

-0.00321  0.00006  _  ' 

Notice  how  the  estimated  variances  of  the  OLS  and  GLS  estimators  are  nearly  three 
times  the  magnitude  of  the  corresponding  variance  estimates  of  the  OLS  calculated 
under  the  independence  assumption  [see  (6.6.14)].  Estimated  95  %  confidence  bounds 
for  the  slope  using  the  GLS  estimate  are  — 0.0216±  1.96(0. 00006) 1/2  =  — 0.0216± 
.0048,  indicating  a  significant  decreasing  trend  in  the  level  of  Lake  Huron  during  the 
years  1875-1972. 

The  iterative  procedure  described  above  was  used  to  produce  maximum  likelihood 
estimates  of  the  parameters.  The  calculations  using  ITSM  are  analogous  to  those 
in  Example  6.6.1.  The  results  from  each  iteration  are  summarized  in  Table  6.3. 
As  in  Example  6.6.1,  the  convergence  of  the  estimates  is  very  rapid. 

□ 


Example  6.6.3  Seat-Belt  Legislation;  SBL.TSM 

Figure  6-18  shows  the  numbers  of  monthly  deaths  and  serious  injuries  Yt ,  t  = 
1,  . . . ,  120,  on  UK  roads  for  10  years  beginning  in  January  1975.  They  are  filed 
as  SBL.TSM.  Seat-belt  legislation  was  introduced  in  February  1983  in  the  hope  of 
reducing  the  mean  number  of  monthly  “deaths  and  serious  injuries,”  (from  t  =  99 
onwards).  In  order  to  study  whether  or  not  there  was  a  drop  in  mean  from  that  time 
onwards,  we  consider  the  regression, 

Yt  =  a  +  bf(t)  +  Wt,  t  =  1,  . . . ,  120,  (6.6.15) 

where  ft  —  0  for  1  <  t  <  98,  and  ft  =  1  for  t  >  99.  The  seat-belt  legislation 
will  be  considered  effective  if  the  estimated  value  of  the  regression  coefficient  b 
is  significantly  negative.  This  problem  also  falls  under  the  heading  of  intervention 
analysis  (see  Section  11.2). 

OLS  regression  based  on  the  model  (6.6.15)  suggests  that  the  error  sequence  {JkJ 
is  highly  correlated  with  a  strong  seasonal  component  of  period  12.  (To  do  the  regres¬ 
sion  using  ITSM  open  the  file  SBL.TSM,  select  Regression>Specify,  check 
only  Include  intercept  term  and  Include  auxiliary  variables, 
press  the  Browse  button,  and  select  the  file  SBLIN.TSM,  which  contains  the 
function  ft  of  (6.6.15)  and  enter  1  for  the  number  of  columns.  Then  select  the 
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Figure  6-18 

Monthly  deaths  and  serious 
injuries  {Yj}  on  UK  roads, 
January  1 975-December 

1984 


option  Regression>Est imation>Generalized  LS.  The  estimates  of  the 
coefficients  a  and  b  are  displayed  in  the  Regression  estimates  window,  and 
the  data  become  the  estimates  of  the  residuals  {VFr}.)  The  graphs  of  the  data  and 
sample  ACF  clearly  suggest  a  strong  seasonal  component  with  period  12.  In  order  to 
transform  the  model  (6.6.15)  into  one  with  stationary  residuals,  we  therefore  consider 
the  differenced  data  Xt  =  Yt  —  Yt_  12,  which  satisfy 

=  bgt  +  Nt ,  t=  13, ... ,  120,  (6.6.16) 

where  gt  =  1  for  98  <  t  <  110,  gt  =  0  otherwise,  and  {Nt  =  Wt  —  Wt- 12}  is  a 
stationary  sequence  to  be  represented  by  a  suitably  chosen  ARMA  model.  The  series 
{Xj}  is  contained  in  the  file  SBLD.TSM,  and  the  function  gt  is  contained  in  the  file 
SBLDIN.TSM. 

The  next  step  is  to  perform  ordinary  least  squares  regression  of  Xt  on  gt  following 
steps  analogous  to  those  of  the  previous  paragraph  (but  this  time  checking  only  the 
box  marked  Include  auxiliary  variables  in  the  Regression  Trend 
Function  dialog  box)  and  again  using  the  option  Regression>Estimation> 
Generalized  LS  or  pressing  the  blue  GLS  button.  The  model 

=  —346.92 gt+Nt,  (6.6.17) 

is  then  displayed  in  the  Regression  estimates  window  together  with  the 
assumed  noise  model  (white  noise  in  this  case).  Inspection  of  the  sample  ACF 
of  the  residuals  suggests  an  MA(I3)  or  AR(13)  model  for  { Nt }.  Fitting  AR 
and  MA  models  of  order  up  to  13  (with  no  mean-correction)  using  the  option 
Model>Estimation>Autof  it  gives  an  MA(12)  model  as  the  minimum  AICC 
fit  for  the  residuals.  Once  this  model  has  been  fitted,  the  model  in  the  Regression 
estimates  window  is  automatically  updated  to 

X,  =  -328.45&  +Nt,  (6.6.18) 

with  the  fitted  MA(12)  model  for  the  residuals  also  displayed.  After  several  iterations 
(each  iteration  is  performed  by  pressing  the  MLE  button)  we  arrive  at  the  model 

Xt  =  — 328.45gr  +  Nt, 


(6.6.19) 
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Figure  6-19 

The  differenced  deaths  and 
serious  injuries  on  UK 
roads  {Xt  =  Yt~  Yt~  -|2}, 
showing  the  fitted 
GLS  regression  line 


1  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985 


Problems 


with 

Nt  =  Z,+0.219Z,_1+0.098Zr_2+0.031Z,_3+0.064Zr_4+0.069Z,_5+0.111Zr_6 

+0.081Zr_7  +  0.057Z,_8+0.092Z,_9  -  0.028Z,_i0+0.183Z,_n-0.627Z?_12, 

where  {Zt}  ~  WN(0,  12,  581).  The  estimated  standard  deviation  of  the  regression 
coefficient  estimator  is  49.41,  so  the  estimated  coefficient,  —328.45,  is  very  signifi¬ 
cantly  negative,  indicating  the  effectiveness  of  the  legislation.  The  differenced  data  are 
shown  in  Figure  6-19  with  the  fitted  regression  function. 

□ 


6.1  Suppose  that  {Xr}  is  an  ARIMA(/?,  d ,  q )  process  satisfying  the  difference 
equations 

4>(B)(  1  -  B)dX,  =  9{B)Z„  {Zt}  ~  WN  (0,  a2)  . 

Show  that  these  difference  equations  are  also  satisfied  by  the  process  Wt  —  Xt  + 
Aq  +  A\t  +  •  •  •  +  Ad_\td~l,  where  A0,  . . . ,  Aj_i  are  arbitrary  random  variables. 

6.2  Verify  the  representation  given  in  (6.3.4). 

6.3  Test  the  data  in  Example  6.3.1  for  the  presence  of  a  unit  root  in  an  AR(2)  model 
using  the  augmented  Dickey-Fuller  test. 

6.4  Apply  the  augmented  Dickey-Fuller  test  to  the  levels  of  Lake  Huron  data 
(LAKE.TSM).  Perform  two  analyses  assuming  AR(1)  and  AR(2)  models. 


6.5  If  {Yt}  is  a  causal  ARM  A  process  (with  zero  mean)  and  if  Zq  is  a  random 
variable  with  finite  second  moment  such  that  Xq  is  uncorrelated  with  Yt  for  each 
t  —  1,2,...,  show  that  the  best  linear  predictor  of  Yn+ iin  terms  of  1, 

Xq,  Yi,  ...  ,Yn  is  the  same  as  the  best  linear  predictor  of  Yn+\  in  terms  of 
1,  Y\,  . . . ,  Yn. 


•  •  • 
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6.6  Let  {X/}  be  the  ARIMA(2,1,0)  process  satisfying 

(1  -  0.8 B  +  0.25 B2)  XX,  =  Z„  {Z,}  ~  WN(0,  1). 

(a)  Determine  the  forecast  function  g(h )  =  PnXn+h  for  h  >  0. 

(b)  Assuming  that  n  is  large,  compute  cr^Qi)  for  h  =  l, ...  ,5. 

6.7  Use  a  text  editor  to  create  a  new  data  set  ASHORT.TSM  that  consists  of  the  data 
in  AIRPASS.TSM  with  the  last  12  values  deleted.  Use  ITSM  to  find  an  ARIMA 
model  for  the  logarithms  of  the  data  in  ASHORT.TSM.  Your  analysis  should 
include 

(a)  a  logical  explanation  of  the  steps  taken  to  find  the  chosen  model, 

(b)  approximate  95  %  bounds  for  the  components  of  </>  and  0, 

(c)  an  examination  of  the  residuals  to  check  for  whiteness  as  described  in 
Section  1.6, 

(d)  a  graph  of  the  series  ASHORT.TSM  showing  forecasts  of  the  next  12  values 
and  95  %  prediction  bounds  for  the  forecasts, 

(e)  numerical  values  for  the  12-step  ahead  forecast  and  the  corresponding  95  % 
prediction  bounds, 

(f)  a  table  of  the  actual  forecast  errors,  i.e.„  the  true  value  (deleted  from 
AIRPASS.TSM)  minus  the  forecast  value,  for  each  of  the  12  forecasts. 

Does  the  last  value  of  AIRPASS.TSM  lie  within  the  corresponding  95  %  pre¬ 
diction  bounds? 

6.8  Repeat  Problem  6.7,  but  instead  of  differencing,  apply  the  classical  decomposi¬ 
tion  method  to  the  logarithms  of  the  data  in  ASHORT.TSM  by  deseasonalizing, 
subtracting  a  quadratic  trend,  and  then  finding  an  appropriate  ARMA  model 
for  the  residuals.  Compare  the  12  forecast  errors  found  from  this  approach  with 
those  found  in  Problem  6.7. 

6.9  Repeat  Problem  6.7  for  the  series  BEER.TSM,  deleting  the  last  12  values 
to  create  a  file  named  BSHORT.TSM. 

6.10  Repeat  Problem  6.8  for  the  series  BEER.TSM  and  the  shortened  series 
BSHORT.TSM. 

6.11  A  time  series  {Xr}  is  differenced  at  lag  12,  then  at  lag  1  to  produce  a  zero-mean 
series  {FJ  with  the  following  sample  ACF: 

p(\2j)  «  (0.8V,  j  —  0,  ±1,  ±2, . . . , 

PC  12/  ±  1)  «  (0.4)(0.8y,  j  =  0,  ±1,  ±2, ... , 

p(h)  ~  0,  otherwise, 

and  y  (0)  =  25. 

(a)  Suggest  a  SARIMA  model  for  {X, }  specifying  all  parameters. 

(b)  For  large  n,  express  the  one-  and  twelve-step  linear  predictors  PnXn+\  and 

v\ 

PnXn+ 12  in  terms  of  Xt,  t  =  — 12,  — 11, . . . ,  n,  and  Yt  —  Yt,  t  =  1,  . . . ,  n. 

(c)  Find  the  mean  squared  errors  of  the  predictors  in  (b). 

6.12  Use  ITSM  to  verify  the  calculations  of  Examples  6.6. 1-6. 6. 3. 

6.13  The  file  TUNDRA. TSM  contains  the  average  maximum  temperature  over  the 
month  of  February  for  the  years  1895-1993  in  an  area  of  the  USA  whose 
vegetation  is  characterized  as  tundra. 
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(a)  Fit  a  straight  line  to  the  data  using  OLS.  Is  the  slope  of  the  line  significantly 
different  from  zero? 

(b)  Find  an  appropriate  ARM  A  model  to  the  residuals  from  the  OLS  fit  in  (a). 

(c)  Calculate  the  MLE  estimates  of  the  intercept  and  the  slope  of  the  line  and 
the  ARM  A  parameters  in  (a).  Is  the  slope  of  the  line  significantly  different 
from  zero? 

(d)  Use  your  model  to  forecast  the  average  maximum  temperature  for  the  years 
1994-2004. 


Time  Series  Models 
for  Financial  Data 


7.1  Historical  Overview 

7.2  GARCH  Models 

7.3  Modified  GARCH  Processes 

7.4  Stochastic  Volatility  Models 

7.5  Continuous-Time  Models 

7.6  An  Introduction  to  Option  Pricing 


In  this  chapter  we  discuss  some  of  the  time  series  models  which  have  been  found  useful 
in  the  analysis  of  financial  data.  These  include  both  discrete-time  and  continuous¬ 
time  models,  the  latter  being  used  widely,  following  the  celebrated  work  of  Black, 
Merton  and  Scholes,  for  the  pricing  of  stock  options.  The  closing  price  on  trading 
day  t ,  say  Pu  of  a  particular  stock  or  stock-price  index,  typically  appears  to  be  non¬ 
stationary  while  the  log  asset  price ,  Xt  :=  log (Pt),  has  observed  sample-paths  like 
those  of  a  random  walk  with  stationary  uncorrelated  increments,  i.e.,  the  differenced 
log  asset  price,  Zt  :=  Xt  —  Xt-\,  known  as  the  log  return  (or  simply  return )  for 
day  t ,  has  sample-paths  resembling  those  of  white  noise.  Although  the  sequence  Zt 
appears  to  be  white  noise,  there  is  strong  evidence  to  suggest  that  it  is  not  independent 
white  noise.  Much  of  the  analysis  of  financial  time  series  is  devoted  to  representing 
and  exploiting  this  dependence,  which  is  not  visible  in  the  sample  autocorrelation 
function  of  { Zt }.  The  continuous  time  analogue  of  a  random  walk  with  independent 
and  identically  distributed  increments  is  known  as  a  Levy  process,  the  most  familiar 
examples  of  which  are  the  Poisson  process  and  Brownian  motion.  Levy  processes 
play  a  key  role  in  the  continuous-time  modeling  of  financial  data,  both  as  models 
for  the  log  asset  price  itself  and  as  building  blocks  for  more  complex  models.  We 
give  a  brief  introduction  to  these  processes  and  some  of  the  continuous-time  models 
constructed  from  them.  Finally  we  consider  the  pricing  of  European  stock  options 
using  the  geometric  Brownian  motion  model  for  stock  prices,  a  model  which,  in  spite 
of  its  limitations,  has  been  found  useful  in  practice. 
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7.1  Historical  Overview 

For  more  than  30  years  now,  discrete-time  models  (including  stochastic  volatility, 
ARCH,  GARCH  and  their  many  generalizations)  have  been  developed  to  reflect  the 
so-called  stylized  features  of  financial  time  series.  These  properties,  which  include  tail 
heaviness,  asymmetry,  volatility  clustering  and  serial  dependence  without  correlation, 
cannot  be  captured  with  traditional  linear  time  series  models  such  as  the  ARMA 
models  considered  earlier  in  this  book.  If  Pt  denotes  the  price  of  a  stock  or  other 
financial  asset  at  time  t ,  t  e  Z,  then  the  series  of  log  returns,  { Zt  log  Pt  —  log  Pt~\}, 
is  typically  modeled  as  a  stationary  time  series.  An  ARMA  model  for  the  series  [Zt} 
would  have  the  property  that  the  conditional  variance  ht  of  Zt  given  {Zs,  s  <  t}  is 
independent  of  t  and  of  {Zs,  s  <  t}.  However  even  a  cursory  inspection  of  most 
empirical  log  return  series  (see  e.g.,  Figure  7-4)  strongly  suggests  that  this  is  not 
the  case  in  practice.  The  fundamental  idea  of  the  ARCH  (autoregressive  conditional 
heteroscedasticity)  model  (Engle  1982)  is  to  incorporate  the  sequence  {ht}  into  the 
model  by  postulating  that 

z,  =  yfhtet,  where  {e,}  ~  IID  N(0,  1) 

and  ht  (known  as  the  volatility)  is  related  to  the  past  values  of  Zf  via  a  relation  of  the 
form, 

p 

h,  =  a0  +  y2  aZ-0 

i—\ 

for  some  positive  integer  p ,  where  ao  >  0  and  at  >  0,  /  =  1, . . .  ,p.  The  GARCH 
(generalized  ARCH)  model  of  Bollerslev  (1986)  postulates  a  more  general  relation, 

p  q 

ht  =  a  o  +  fift-i, 

i=  1  i=\ 

with  ao  >  0,  at  >  0,  i  —  1 and  fa  >  0,  /  =  1 , ,q.  These  models  have 
been  studied  intensively  since  their  introduction  and  a  variety  of  parameter  estimation 
techniques  have  been  developed.  They  will  be  discussed  in  Section  7.2  and  some  of 
their  extensions  in  Section  7.3. 

An  alternative  approach  to  modeling  the  changing  variability  of  log  returns,  due 
to  Taylor  (1982),  is  to  suppose  that  Zt  —  +Jhtet,  where  { et }  ~  IID(0,  1)  and  the 
volatility  sequence  {ht}  is  independent  of  {et}.  (Taylor  originally  allowed  {et}  to  be 
an  autoregression,  but  it  is  now  customary  to  use  the  more  restrictive  definition  just 
given.)  A  critical  difference  from  the  ARCH  and  GARCH  models  is  the  fact  that  the 
conditional  distribution  of  ht  given  {hs,  s  <  t}  is  independent  of  {es,  s  <  t}.  A  widely 
used  special  case  of  this  model  is  the  so-called  log-normal  stochastic  volatility  (SV) 
model  in  which  {et}  ~  IID  N(0,  I),  In ht  =  y0  +  \nht-i  +  r\u  {r]t}  ~  IID  N(0,  a2) 
and  {rit}  and  { et }  are  independent.  We  shall  discuss  this  model  in  Section  7.4. 

Continuous-time  models  for  financial  time  series  have  a  long  history,  going  back 
at  least  to  Bachelier  (1900),  who  used  Brownian  motion  to  represent  the  prices 
f P(t ),  t  >  0}  of  a  stock  in  the  Paris  stock  exchange.  This  model  had  the  unfortunate 
feature  of  permitting  negative  stock  prices,  a  shortcoming  which  was  eliminated  in 
the  geometric  Brownian  motion  model  of  Samuelson  (1965),  according  to  which  P(t) 
satisfies  an  Ito  stochastic  differential  equation  of  the  form, 


d P{t)  —  /aP(t)  d t  +  crP(t)  d B(t), 
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where  /i  e  R,  a  >  0  and  B  is  standard  Brownian  motion.  For  any  fixed  positive  value 
of  P(0)  the  solution  (see  Section  7.5.2  and  Appendix  D.4)  is 

P(t)  —  P( 0)  exp  [(/x  —  a2 /2)t  +  cfB(t)\ ,  t  >  0, 

so  that  the  log  asset  price,  X(0  :=  log  P(0,  is  Brownian  motion  and  the  log  return  over 
the  time-interval  (t,t  +  A)  is 

X(f  +  Z\)  -  X(t)  =  (ji-  -a2) A  +  or(fl(f  +  A)  -  B(t )). 

For  disjoint  intervals  of  length  A  the  log  returns  are  therefore  independent  normally 
distributed  random  variables  with  mean  (/z— a2 /2)A  and  variance  a2 A.  The  normality 
is  a  conclusion  which  can  easily  be  checked  against  observed  log  returns,  and  it  is 
found  that  although  the  observed  values  are  approximately  normally  distributed  for 
intervals  A  greater  than  1  day,  the  deviations  from  normality  are  substantial  for  shorter 
time  intervals.  This  is  one  of  the  reasons  for  developing  the  more  realistic  models 
described  in  Section  7.5.  The  parameter  o2  is  called  the  volatility  parameter  of  the 
geometric  Brownian  motion  model  and  plays  a  key  role  in  the  celebrated  option  pricing 
results  (see  Section  7.6)  developed  for  this  model  by  Black,  Scholes  and  Merton, 
earning  the  Nobel  Economics  Prize  for  Merton  and  Scholes  in  1997  (unfortunately 
Black  died  before  the  award  was  made).  These  results  inspired  an  explosion  of  interest, 
not  only  in  the  pricing  of  more  complicated  financial  derivatives,  but  also  in  the 
development  of  new  continuous-time  models  which,  like  the  discrete-time  ARCH, 
GARCH  and  stochastic  volatility  models,  better  reflect  the  observed  properties  of 
financial  time  series. 


7.2  GARCH  Models 

For  modeling  changing  volatility  as  discussed  above,  Engle  (1982)  introduced  the 
ARCH(p)  process  {Ztj  as  a  stationary  solution  of  the  equations 

Zr  =  yfh,et,  {e,}  ~  IID  N(0,  1),  (7.2.1) 

where  ht  is  the  (positive)  function  of  {Zs,  s  <  t},  defined  by 

p 

h,  =  a0  +  ^2  otiZ2_t,  (' 7.2.2 ) 

i=  1 

with  ao  >  0  and  olj  >  0,  j  =  1 The  name  ARCH  signifies  autoregressive 
conditional  heteroscedasticity  and  ht  is  the  conditional  variance  of  Zt  given  { Zs  ,S  <t}. 

The  simplest  such  process  is  the  ARCH(l)  process.  In  this  case  the  recursions 
(7.2.1)  and  (7.2.2)  give 

ry2  _  2  I  ry2  2 

Zt  =  a§et  +  ot\ Zt_xet 

2  ,  2  2  |  2ry2  2  2 

=  oiQet  +  OL\OL§etet_x  +  alZt_2etet_l 


Ej  2  2  2  I  n+\ry2  2  2  2 

<x[e,et_ j  •  •  •  et_j  +  Zt_n_xetet_x  •  •  •  et_n. 
j= o 
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If  0L\  <  1  and  [Zt]  is  stationary  and  causal  (i.e.,  Zt  is  a  function  of  { es ,  s  <  t }),  then 
the  last  term  has  expectation  an+lEZf  and  converges  to  zero  as  n  ->  oo.  The  first  term 
converges  as  n  ->  oo  since  it  is  non-decreasing  in  n  and  its  expected  value  is  bounded 
above  by  oro/(l  —  oq).  Hence 


and 


rj2  J  2  2  2 

j= 0 


EZ r2  =  ofo/(l  —  n't). 


Since 


(7.2.3) 


(7.2.4) 


ai 


(7.2.5) 


it  is  clear  that  {Zt}  is  strictly  stationary  and  hence,  since  EZf  <  oo,  also  stationary 
in  the  weak  sense.  We  have  now  established  the  following  result. 


Solution  of  the  ARCH(l)  Equations: 

If  oq  <  I,  the  unique  causal  stationary  solution  of  the  ARCH(l)  equations  is  given 
by  (7.2.5).  It  has  the  properties 

E(Zt)  =  E(E(Zt\es,  s  <t))=  0, 

Var(Z?)  =  of0/(l  - 

and 

E(Zt+hZt )  =  E(E(Zt+hZt\es,  s  <  t  +  h))  =  0  for  h  >  0. 


Thus  the  ARCH(I)  process  with  a  i  <  I  is  strictly  stationary  white  noise.  However, 
it  is  not  an  iid  sequence,  since  from  (7.2.1)  and  (7.2.2), 

E{Z^\Zt-i)  =  (or0  +  oi\Z^_l)E(e^\Zt_i)  =  a0  +  otiZ?_v 

This  also  shows  that  { Zt }  is  not  Gaussian,  since  strictly  stationary  Gaussian  white  noise 
is  necessarily  iid.  From  (7.2.5)  it  is  clear  that  the  distribution  of  Zt  is  symmetric,  i.e., 
that  Zt  and  —  Zt  have  the  same  distribution.  From  (7.2.3)  it  is  easy  to  calculate  E(zf) 
(Problem  7.1)  and  hence  to  show  that  E(zf)  is  finite  if  and  only  if  3ar^  <  1.  More 
generally  (see  Engle  1982),  it  can  be  shown  that  for  every  oq  in  the  interval  (0,  1), 
E(Zlk)  —  oo  for  some  positive  integer  k.  This  indicates  the  “heavy-tailed”  nature  of 
the  marginal  distribution  of  Zt.  If  FZr4  <  oo,  the  squared  process  Yt  =  Z?~  has  the  same 
ACF  as  the  AR(1)  process  Wt  =  oq  Wt-\  +  et,  a  result  that  extends  also  to  ARCH (p) 
processes  (see  Problem  7.3). 

The  ARCH (p)  process  is  conditionally  Gaussian,  in  the  sense  that  for  given  values 
of  { Zs ,  s  =  t  —  1,  t  —  2, . . . ,  t  —  p),  Zt  is  Gaussian  with  known  distribution.  This 
makes  it  easy  to  write  down  the  likelihood  of  Zp+\ , ,Zn  conditional  on  {Z\ , . . . ,  Zp) 
and  hence,  by  numerical  maximization,  to  compute  conditional  maximum  likelihood 
estimates  of  the  parameters.  For  example,  the  conditional  likelihood  of  observations 
{Z2, . . . ,  zn}  of  the  ARCH(l)  process  given  Z\  —  z \  is 


7.2  GARCH  Models 


199 


Figure  7-1 

A  realization  of  the  process 
Zf  =  et-y/l  +  0.5Z^_^ 


Figure  7-2 

The  sample  autocorrelation 
function  of  the  series  in 
Figure  7-1 


Example  7.2.1 
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t= 2  J 2n  (a0  + 


exp  • 


2(^0  +  a'izr2_1) 


An  ARCH(l)  Series 

Figure  7-1  shows  a  realization  of  the  ARCH(l)  process  with  a()  =  1  and  a\  =  0.5.  The 
graph  of  the  realization  and  the  sample  autocorrelation  function  shown  in  Figure  7-2 
suggest  that  the  process  is  white  noise.  This  conclusion  is  correct  from  a  second-order 
point  of  view. 

However,  the  fact  that  the  series  is  not  a  realization  of  iid  noise  is  very  strongly 
indicated  by  Figure  7-3,  which  shows  the  sample  autocorrelation  function  of  the  series 
[Zf].  (The  sample  ACF  of  {\Zt\}  and  that  of  {Zf}  can  be  plotted  in  ITSM  by  selecting 
Statistics>Residual  Analysis>ACF  abs  values/Squares.) 
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Figure  7-3 

The  sample  autocorrelation 
function  of  the 
squares  of  the  data  shown  in 

Figure  7-1 
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It  is  instructive  to  apply  the  Ljung-Box  and  McLeod-Li  portmanteau  tests  for 
white  noise  to  this  series  (see  Section  1.6).  To  do  this  using  ITSM,  open  the  file 
ARCH. TSM,  and  then  select  Statistics>Residual  Analysis>Tests  of 
Randomness.  We  find  (with  h  —  20)  that  the  Ljung-Box  test  (and  all  the  others 
except  for  the  McLeod-Li  test)  are  passed  comfortably  at  level  0.05.  However, 
the  McLeod-Li  test  gives  a  p-\ alue  of  0  to  five  decimal  places,  clearly  reject¬ 
ing  the  hypothesis  that  the  series  is  iid. 

□ 

The  GARCH(/;,  q )  process  (see  Bollerslev  1986)  is  a  generalization  of  the 
ARCH {p)  process  in  which  the  variance  equation  (7.2.2)  is  replaced  by 

p  q 

ht  =  a0  +  ^2  +  L]  Pjht-j,  (7.2.6) 

i=  1  7=1 

with  ao  >  0  and  aj,  fy  >  0,  j  —  1,  2, . . . . 

In  the  analysis  of  empirical  financial  data  such  as  percentage  daily  stock  returns 
(defined  as  100  ln(Pt/Pt_i),  where  Pt  is  the  closing  price  on  trading  day  t),  it  is  usually 
found  that  better  fits  to  the  data  are  obtained  by  relaxing  the  Gaussian  assumption  in 
(7.2.1)  and  supposing  instead  that  the  distribution  of  Zt  given  {Zs,  s  <  t }  has  a  heavier- 
tailed  zero-mean  distribution  such  as  Student’s  ^-distribution.  To  incorporate  such 
distributions  we  can  define  a  general  GARCH (p,  q )  process  as  a  stationary  process 
[Zt]  satisfying  (7.2.6)  and  the  generalized  form  of  (7.2.1), 

z,  =  yfhteu  {et}  ~  IID(0,  1).  (7.2.7) 

For  modeling  purposes  it  is  usually  assumed  in  addition  that  either 

et~N(  0,  1),  (7.2.8) 


(as  in  (7.2.1))  or  that 


v 


v  —  2 


et  ~ 


ty » 


v  >  2, 


(7.2.9) 
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Figure  7-4 

The  daily  percentage  returns 
of  the  Dow  Jones  Industrial 
Index  (El  032. TSM)  from 
July  1,  1997,  through  April 
9,  1  999  (above),  and  the 
estimates  of  c>t  =  \fh[  for 
the  conditional  Gaussian 
GARCH(1,1)  model  of 
Example  7.2.2 


Example  7.2.2 


where  tv  denotes  Student’s  ^-distribution  with  v  degrees  of  freedom.  (The  scale  factor 
on  the  left  of  (7.2.9)  is  introduced  to  make  the  variance  of  et  equal  to  1.)  Other 
distributions  for  et  can  also  be  used. 

One  of  the  striking  features  of  stock  return  data  that  is  reflected  by  GARCH  models 
is  the  “persistence  of  volatility,”  or  the  phenomenon  that  large  (small)  fluctuations  in 
the  data  tend  to  be  followed  by  fluctuations  of  comparable  magnitude.  GARCH  models 
reflect  this  by  incorporating  correlation  in  the  sequence  {ht}  of  conditional  variances. 

Fitting  GARCH  Models  to  Stock  Data 

The  top  graph  in  Figure  7-4  shows  the  percentage  daily  returns  of  the  Dow  Jones 
Industrial  Index  for  the  period  July  1st,  1997,  through  April  9th,  1999,  contained 
in  the  file  E1032.TSM.  The  graph  suggests  that  there  are  sustained  periods  of  both 
high  volatility  (in  October,  1997,  and  August,  1998)  and  of  low  volatility.  The  sample 
autocorrelation  function  of  this  series,  like  that  in  Example  7.2.1,  has  very  small  values, 
however  the  sample  autocorrelations  of  the  absolute  values  and  squares  of  the  data  (like 
those  in  Example  7.2.1)  are  significantly  different  from  zero,  indicating  dependence  in 
spite  of  the  lack  of  autocorrelation.  (The  sample  autocorrelations  of  the  absolute  values 
and  squares  of  the  residuals  (or  of  the  data  if  no  transformations  have  been  made  and 
no  model  fitted)  can  be  seen  by  clicking  on  the  third  green  button  at  the  top  of  the 
ITSM  window.)  These  properties  suggest  that  an  ARCH  or  GARCH  model  might  be 
appropriate  for  this  series. 

□ 

The  model 

Yt  =  a  +  Zt ,  (7.2.10) 

where  {Zt}  is  the  GARCH (/?,  q)  process  defined  by  (7.2.6)-(7.2.8),  can  be  fitted  using 
ITSM  as  follows.  Open  the  project  E1032.TSM  and  click  on  the  red  button  labeled 
GAR  at  the  top  of  the  ITSM  screen.  In  the  resulting  dialog  box  enter  the  desired  values 
of  p  and  q ,  e.g.,  1  and  1  if  you  wish  to  fit  a  GARCH(1,1)  model.  You  may  also  enter 
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initial  values  for  the  coefficients  ao, ...  ,ap,  and  /3\,  . . . ,  /3q,  or  alternatively  use  the 
default  values  specified  by  the  program.  Make  sure  that  Use  normal  noise  is 
selected,  click  on  OK  and  then  click  on  the  red  MLE  button.  You  will  be  advised  to 
subtract  the  sample  mean  (unless  you  wish  to  assume  that  the  parameter  a  in  (7.2.10) 
is  zero).  If  you  subtract  the  sample  mean  it  will  be  used  as  the  estimate  of  a  in 
the  model  (7.2.10).  The  GARCH  Maximum  Likelihood  Estimation  box  will 
then  open.  When  you  click  on  OK  the  optimization  will  proceed.  Denoting  by  { Zt } 
the  (possibly)  mean-corrected  observations,  the  GARCH  coefficients  are  estimated 
by  numerically  maximizing  the  likelihood  of  Zp+ 1,  . . . ,  Zn  conditional  on  the  known 
values  Z\, ...  ,ZP,  and  with  assumed  values  0  for  each  Zt,  t  <  0,  and  o2  for  each  ht , 
t  <  0,  where  a2  is  the  sample  variance  of  {Z\, . . . ,  Zn}.  In  other  words  the  program 
maximizes 


L(a o,  . . 


.  ,  Oip ,  (5\ ,  .  .  .  ,  fiq)  — 


H  1 

n  -* 

1  1  <Jt 


t=p+\ 


(7.2.11) 


with  respect  to  the  coefficients  ao,  ...  ,ap  and  fii,  . .. ,  /3q,  where  </>  denotes  the  stan¬ 
dard  normal  density,  and  the  standard  deviations  ot  —  +Jht,  t  >  1,  are  computed 

/*w  _ 

recursively  from  (7.2.6)  with  Zt  replaced  by  Zh  and  with  Zt  =  0  and  ht  =  a2  for 
t  <  0.  To  find  the  minimum  of  — 21n(L)  it  is  advisable  to  repeat  the  optimization  by 
clicking  on  the  red  MLE  button  and  then  on  OK  several  times  until  the  result  stabilizes. 
It  is  also  useful  to  try  other  initial  values  for  ao, ... ,  ap,  and  /?i,  . . . ,  /3q,  to  minimize 
the  chance  of  finding  only  a  local  minimum  of  — 21n(L).  Note  that  the  optimization 
is  constrained  so  that  the  estimated  parameters  are  all  non-negative  with 

a\  ap  -\-  ji\  jiq  <  1,  (7.2.12) 

and  ao  >  0.  Condition  (7.2.12)  is  necessary  and  sufficient  for  the  corresponding 
GARCH  equations  to  have  a  causal  weakly  stationary  solution. 

Comparison  of  models  with  different  orders  p  and  q  can  be  made  with  the  aid  of 
the  AICC,  which  is  defined  in  terms  of  the  conditional  likelihood  L  as 

AICC  :=  —2 - InL  +  2 (p  +  q  +  2 )n/(n  —  p  —  q  —  3).  (7.2.13) 

n  —  p 

The  factor  n/{n  —  p)  multiplying  the  first  term  on  the  right  has  been  introduced  to 
correct  for  the  fact  that  the  number  of  factors  in  (7.2.1 1)  is  only  n—p.  Notice  also  that 
the  GARCH {p,  q)  model  has  p  +  q  +  1  coefficients. 

The  estimated  mean  is  a  =  0.0608  and  the  minimum- AICC  GARCH  model  (with 
Gaussian  noise)  for  the  residuals,  Zt  =  Yt  —  a,  is  found  to  be  the  GARCH(1,1)  with 
estimated  parameter  values 

ao  =  0.1300,  &i  =  0.1266,  =  0.7922, 


and  an  AICC  value  [defined  by  (7.2.13)]  of  1469.02.  The  bottom  graph  in  Figure  7-4 
shows  the  corresponding  estimated  conditional  standard  deviations,  o> ,  which  clearly 
reflect  the  changing  volatility  of  the  series  {Fr}.  This  graph  is  obtained  from  ITSM 
by  clicking  on  the  red  SV  (stochastic  volatility)  button.  Under  the  model  defined  by 
(7.2.6)-(7.2.8)  and  (7.2.10),  the  GARCH  residuals,  {Zr/o>},  should  be  approximately 
IID  N(0,1).  A  check  on  the  independence  is  provided  by  the  sample  ACF  of  the 
absolute  values  and  squares  of  the  residuals,  which  is  obtained  by  clicking  on 
the  fifth  red  button  at  the  top  of  the  ITSM  window.  These  are  found  to  be  not 
significantly  different  from  zero.  To  check  for  normality,  select  Garch>Garch 
residuals>QQ-Plot  (normal) .  If  the  model  is  appropriate  the  resulting  graph 
should  approximate  a  straight  line  through  the  origin  with  slope  1 .  It  is  found  that  the 
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deviations  from  the  expected  line  are  quite  large  for  large  values  of  Zt  ,  suggesting  the 
need  for  a  heavier-tailed  model,  e.g.,  a  model  with  conditional  ^-distribution  as  defined 
by  (7.2.9). 

To  fit  the  GARCH  model  defined  by  (7.2.6),  (7.2.7),  (7.2.9)  and  (7.2.10)  (i.e., 
with  conditional  ^-distribution),  we  proceed  in  the  same  way,  but  with  the  conditional 
likelihood  replaced  by 


L(a0, 


n 

.  ,  Oip,  •••  ,  Pq 5  v)  —  |  J 

t=p+ 1 


(7.2.14) 

Maximization  is  now  carried  out  with  respect  to  the  coefficients,  ocq,  . . . ,  ap,  j5\ , . . . ,  j5q 
and  the  degrees  of  freedom  v  of  the  ^-density,  tv .  The  optimization  can  be  performed 
using  ITSM  in  exactly  the  same  way  as  described  for  the  GARCH  model  with  Gaussian 
noise,  except  that  the  option  Use  t -distribution  for  noise  should  be 
checked  in  each  of  the  dialog  boxes  where  it  appears.  In  order  to  locate  the  minimum 
of  —  21n(L)  it  is  often  useful  to  initialize  the  coefficients  of  the  model  by  first  fitting 
a  GARCH  model  with  Gaussian  noise  and  then  carrying  out  the  optimization  using 
^-distributed  noise. 

The  estimated  mean  is  a  =  0.0608  as  before  and  the  minimum-AICC  GARCH 
model  for  the  residuals,  Z,  =  Yt  —  a,  is  the  GARCH(1,1)  with  estimated  parameter 
values 


a0  =  0.1324,  o'!  =  0.0672,  /?i  =  0.8400, 


0  =  5.714 


and  an  AICC  value  (as  in  (7.2.13)  with  q  replaced  by  q  +  1)  of  1437.89.  Thus  from 
the  point  of  view  of  AICC,  the  model  with  conditional  ^-distribution  is  substantially 
better  than  the  conditional  Gaussian  model.  The  sample  ACF  of  the  absolute  values 
and  squares  of  the  GARCH  residuals  are  much  the  same  as  those  found  using  Gaussian 
noise,  but  the  qq  plot  (obtained  by  clicking  on  the  red  QQ  button  and  based  on  the  t- 
distribution  with  5.714  degrees  of  freedom)  is  closer  to  the  expected  line  than  was  the 
case  for  the  model  with  Gaussian  noise. 

There  are  many  important  and  interesting  theoretical  questions  associated  with 
the  existence  and  properties  of  stationary  solutions  of  the  GARCH  equations  and  their 
moments  and  of  the  sampling  properties  of  these  processes.  As  indicated  above,  in 
maximizing  the  conditional  likelihood,  ITSM  constrains  the  GARCH  coefficients  to 
be  non-negative  and  to  satisfy  the  condition  (7.2.12)  with  ao  >  0.  These  conditions 
are  sufficient  for  the  process  defined  by  the  GARCH  equations  to  be  stationary.  It  is 
frequently  found  in  practice  that  the  estimated  values  of  oq,  . . . ,  ap  and  /3i,  . . . ,  j5q 
have  a  sum  which  is  very  close  to  1.  A  GARCH(p,q)  model  with  oq  +  •  •  •  +  ap  + 
Pi  +  *  *  *  Pq  —  1  is  called  I-GARCH  (or  integrated  GARCH).  Many  generalizations 
of  GARCH  processes  (ARCH-M,  E-GARCH,  I-GARCH,  T-GARCH,  FI-GARCH, 
etc.,  as  well  as  ARMA  models  driven  by  GARCH  noise,  and  regression  models  with 
GARCH  errors)  can  now  be  found  in  the  econometrics  literature  see  Andersen  et  al. 
(2009). 

ITSM  can  be  used  to  fit  ARMA  and  regression  models  with  GARCH  noise  by 
using  the  procedures  described  in  Example  7.2.2  to  fit  a  GARCH  model  to  the  residuals 
f Zt }  from  the  ARMA  (or  regression)  fit. 


Example  7.2.3  Fitting  ARMA  Models  Driven  by  GARCH  Noise 


If  we  open  the  data  file  SUNSPOTS. TSM,  subtract  the  mean  and  use  the  option 
Model>Estimation>Autof  it  with  the  default  ranges  for p  and  q ,  we  obtain  an 
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ARMA(3,4)  model  for  the  mean-corrected  data.  Clicking  on  the  second  green  button 
at  the  top  of  the  ITSM  window,  we  see  that  the  sample  ACF  of  the  ARMA  residuals 
is  compatible  with  iid  noise.  However  the  sample  autocorrelation  functions  of  the 
absolute  values  and  squares  of  the  residuals  (obtained  by  clicking  on  the  third  green 
button)  indicate  that  they  are  not  independent.  To  fit  a  Gaussian  GARCH(1,1)  model 
to  the  ARMA  residuals  click  on  the  red  GAR  button,  enter  the  value  1  for  both  p  and 
q  and  click  OK.  Then  click  on  the  red  MLE  button,  click  OK  in  the  dialog  box,  and 
the  GARCH  ML  Estimates  window  will  open,  showing  the  estimated  parameter 
values.  Repeat  the  steps  in  the  previous  sentence  two  more  times  and  the  window  will 
display  the  following  ARMA(3,4)  model  for  the  mean-corrected  sunspot  data  and  the 
fitted  GARCH  model  for  the  ARMA  noise  process  {Zr}, 

X,  =  2.463X,_!  -  2.248X,_2  +  0.757Xr_3  +  Z,  -  0.948Z,_i 

-  0.296Zr_2  +  0.313Zr_3  +  0.136Zr_4, 

where 

z,  =  yfh,e, 

and 

h,  =  31.152  +  0.223 Zf_x  +  0.596*f_! . 

The  AICC  value  for  the  GARCH  fit  (805. 12)  should  be  used  for  comparing  alternative 
GARCH  models  for  the  ARMA  residuals.  The  AICC  value  adjusted  for  the  ARMA 
fit  (821.70)  should  be  used  for  comparison  with  alternative  ARMA  models  (with 
or  without  GARCH  noise).  Standard  errors  of  the  estimated  coefficients  are  also 
displayed. 

Simulation  using  the  fitted  ARMA(3,4)  model  with  GARCH  (1,1)  noise  can 
be  carried  out  by  selecting  Garch>Simulate  Garch  process.  If  you  retain 
the  settings  in  the  ARMA  Simulation  dialog  box  and  click  OK  you  will  see  a  simulated 
realization  of  the  model  for  the  original  data  in  SUNSPOTS. TSM. 

Some  useful  references  for  extensions  and  further  properties  of  GARCH  models  are 
Weiss  (1986),  Engle  (1995),  Shephard  (1996),  Gourieroux  (1997),  Lindner  (2009)  and 
Francq  and  Zakoian  (2010). 


7.3  Modified  GARCH  Processes 

The  following  are  so-called  “stylized  features”  associated  with  observed  time  series 
of  financial  returns: 

(i)  the  marginal  distributions  have  heavy  tails, 

(ii)  there  is  persistence  of  volatility, 

(iii)  the  returns  exhibit  aggregational  Gaussianity, 

(iv)  there  is  asymmetry  with  respect  to  negative  and  positive  disturbances  and 

(v)  the  volatility  frequently  exhibits  long-range  dependence. 


The  properties  (i),  (ii)  and  (iii)  are  well  accounted  for  by  the  GARCH  models  of 
Section  7.2.  Property  (iii)  means  that  the  sum,  Sn  =  Y^t=\  °f  the  daily  returns, 
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Zt  —  In  Pt — In  Pt_  i ,  is  approximately  normally  distributed  if  n  is  large.  For  the  GARCH 
model  with  EZ 2  =  cr 2  <  oo  it  follows  from  the  martingale  central  limit  theorem  (see 
e.g.  Billingsley  (1995))  that  n~l,2{\nPn  —  lnP0)  =  n~l/2  Y^t=\Zt  is  asymptotically 
N(0,  a2),  in  accordance  with  (iii). 

To  account  for  properties  (iv)  and  (v)  the  EGARCH  and  FIGARCH  models  were 
devised. 


7.3.1  EGARCH  Models 

To  allow  negative  and  positive  values  of  et  in  the  definition  of  the  GARCH  process 
to  have  different  impacts  on  the  subsequent  volatilities,  hs,  (s  >  t).  Nelson  (1991) 
introduced  EGARCH  models,  illustrated  in  the  following  simple  example. 

EGARCH(1,1) 


Consider  the  process  {Zt}  defined  by  the  equations, 

Zt  =  yfhte„  {et}  ~  IID(0,  1),  (7.3.1) 

where  { I: ,  :=  In  h,}  is  the  weakly  and  strictly  stationary  solution  of 

£t  =  c  +  aig(et-i)  +  Yiit-i,  (7.3.2) 

c  e  el,  |yi|  <  1, 


g(et)  =et  +  \(\et\  -  E\e, |) 


(7.3.3) 


and  et  has  a  distribution  symmetric  about  zero,  i.e.,  e,  =  —e,. 

The  process  is  defined  in  terms  of  it  to  ensure  that  ht(=  el‘)  >  0.  Equation  (7.3.3) 
can  be  rewritten  as 


g(et) 


(1  +  X)et  —  kE\et 
(1  -  X)et  -  XE\et 


if  et  >  0, 

if  et  <  0. 


showing  that  the  function  g  is  piecewise  linear  with  slope  (1  +  A)  on  (0,  oo)  and  slope 
(1  —  A)  on  (— oo,  0).  This  asymmetry  in  g  allows  lh  to  respond  differently  to  positive 
and  negative  shocks  et_\  of  the  same  magnitude.  If  A  =  0  there  is  no  asymmetry. 

When  fitting  EGARCH  models  to  stock  prices  it  is  usually  found  that  the  estimated 
value  of  A  is  negative,  corresponding  to  large  negative  shocks  having  greater  impact 
on  volatility  than  positive  ones  of  the  same  magnitude. 


Properties  of  {g(et)}\  (i)  { g(et)}  is  iid. 

(ii)  Eg(et)  =  0. 

(iii)  Var(g(^))  =  1  +  A2Var(|^|). 


(The  symmetry  of  et  implies  that  et  and  \et\  —  E \et 


are  uncorreiaieu, 


□ 


More  generally,  the  EGARCH(p,  q)  process  is  obtained  by  replacing  the  equation 
(7.3.2)  for  lt  :=  In  ht  by 


l ,  =  c  +  a  ( B)g{e, )  +  Y  (B)lt, 


(7.3.4) 
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where 


a(B)  =  y(B)  =  J2  Yi&- 


i=  1 


i=  1 


Clearly  { lt },  {/ij  and  {Z?}  are  all  strictly  stationary  and  causal  if  1  —  y(z)  is  non-zero 
for  all  complex  z  such  that  |z|  <  1. 

Nelson  also  proposed  the  use  of  the  generalized  error  distribution  (GED)  for  eu 
with  density 

_  vexp[(— l/2)|x/^n 
nx  ^.21+1/v r(i/v)  ’ 

where 


2(-2/l')r(1/p)|1/2 
r(3/v)  j 


and  v  >  0.  The  value  of  £  ensures  that  Var(et)  =  1  and  the  parameter  v  determines 
the  tail  heaviness.  For  v  =  2,  et  ~N(0,  1).  Tail  heaviness  increases  as  v  decreases. 

Properties  of  the  GED:  (i)/  is  symmetric  and  ^  \et/%  |y  has  the  gamma  distribution 
with  parameters  1/v  and  1  (see  Appendix  A.l,  Example  (d)). 


(ii)  The  specified  value  of  £  ensures  that  Var(<^)  =  1. 


(iii)  E\et 


\k  _  r((fe+i)/v)  f  r(i/v)1 

I  r(i/v)  '  |_  r(3/v) J 


k/2 


Inference  via  Conditional  Maximum  Likelihood 

As  in  Section  7.2  we  initialize  the  recursions  (7.3.1)  and  (7.3.4)  by  supposing  that 

(i)  ht  —  a2,  t  <  0. 

(ii)  et  =  0,  t  <  0. 


Then  h\,  e\  (=  Z\/^/h\),  /z2,  e2,  . . . ,  can  be  computed  recursively  from  the 
observations  Z\,  Z2,  . . .,  and  the  recursions  defining  the  process. 

The  conditional  likelihood  is  then  computed  as 


n 


l 


L 


We  therefore  need  to  minimize 


n 


n 


21nL  =  J2lnh'  +  J2 


t= l 


t= l 


z, 


%sfht 


V 


+  2n In  (  —  -21/vr(l/v) 


with  respect  to 


C)  7, ,  v,  oi  i , . . . ,  oip ,  yi?  •  •  •  5 

Since  /zr  is  automatically  positive,  the  only  constraints  in  this  optimization  are  the 
conditions 


v  >  0 
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and 


1  —  y  (z)  7^  0  for  all  complex  z  such  that  |z|  <  1. 


7.3.2  F1GARCH  and  IGARCH  Models 

To  allow  for  the  very  slow  decay  of  the  sample  ACF  frequently  observed  in  long  daily 
squared  return  series,  the  FIGARCH  (fractionally  integrated  GARCH)  models  were 
developed.  Before  introducing  them  we  first  give  a  very  brief  account  of  fractionally 
integrated  ARMA  processes.  (For  more  details  see  Section  11.4  and  Brockwell  and 
Davis  (1991),  Section  13.2.) 


Fractionally  Integrated  ARMA  Processes  and  “Long  Memory” 


The  autocorrelation  function  p(-)  of  an  ARMA  process  at  lag  h  converges  rapidly 
to  zero  as  h  ->  oo  in  the  sense  that  there  exists  r  >  1  such  that 

rh  p(h)  — >  0,  as  h  ->  oo. 


The  fractionally  integrated  ARMA  (or  ARFIMA)  process  of  order  ( p ,  d ,  q ),  where 
p  and  q  are  non-negative  integers  and  0  <  d  <  0.5,  is  a  stationary  time  series  with  an 
autocorrelation  function  which  for  large  lags  decays  at  a  much  slower  rate.  It  is  defined 
to  be  the  zero-mean  stationary  solution  {Xt }  of  the  difference  equations 

(1  -B)d4>(B)Xt  =  d{B)Zt,  (7.3.5) 


where  </>  (z)  and  9  (z)  are  polynomials  of  degrees  p  and  q  respectively,  with  no  common 
zeroes,  satisfying 


0(z)  7^  0  and  0(z)  7^  0 


for  all  complex  z  such  that  |z|  <  1 


{Zt}  ~  WN(0,  a2),  B  is  the  backward  shift  operator,  and  (1  —  B)r ,  is  defined  via  the 
power  series  expansion, 


oo 


d-zY  :=l  +  £ 

;=  i 


r(r-  l)...(r  -j  +  1) 


(— zY,  |z|  <1,  re  R. 


The  zero-mean  stationary  process  {Xr}  defined  by  (7.3.5)  has  the  mean-square  conver¬ 
gent  MA(oo)  representation, 

oo 

Xt  —  ^  tyjZt-ji 

j= 0 


where  x//j  is  the  coefficient  of  z7  in  the  power  series  expansion, 


0(z)  =  (l-z)-^(z)/0(z), 


<  1. 


The  autocorrelations  p  ( j )  of  {Xr}  at  lag  j  and  the  coefficients  xj/j  both  converge  to  zero 
at  hyperbolic  rates  as  j  -»  oo;  specifically,  there  exist  non-zero  constants  y  and  8  such 
that 

jl~dfj  y  and  jl~2dp(j)  —  <5. 

Thus  xj/j  and  p(j)  converge  to  zero  as  j  oo  at  much  slower  rates  than  the 
corresponding  coefficients  and  autocorrelations  of  an  ARMA  process.  Consequently 
fractionally  integrated  ARMA  processes  are  said  to  have  “long  memory".  The  spectral 
density  of  {XJ  is  given  by 
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m 


\6{e~iX)\2 


2 jt  \<j>(e  lX)\2 


-2d 


The  exact  Gaussian  likelihood  L  of  observations  xn 
ally  integrated  ARMA  process  is  given  by 

— 21n(L)  =  nln(2n)  +  lndetf^  +  x'nV~lxn, 


(x\, ... ,  xn)f  of  a  fraction- 


where  Tn  —  E(XnXfn).  Calculation  and  maximization  with  respect  to  the  parameters 
d,  0i,  . . . ,  <pp,  9\,  . . . ,  6q  and  a2  is  difficult.  It  is  much  easier  to  maximize  the  Whittle 
approximation  Lw  (see  (11.4.10)),  i.e.  to  minimize 


—2  In  (Lw)  =  nln(27t)  +  E*“  (2*/(<,))  +  £ 

j  j 


hi  (^/) 

2 7tf(C0j)  ’ 


where  In  is  the  periodogram,  and  JT  denotes  the  sum  over  all  nonzero  Fourier 
frequencies,  coj  =  2i rj/n  e  The  program  ITSM  allows  estimation  of 

parameters  for  ARIMA(p,  d ,  q )  models  either  by  minimizing  —2  In  (Lw),  or  by  the 
slower  and  more  computationally  intensive  process  of  minimizing  — 21n(L). 


Fractionally  Integrated  GARCH  Processes 

In  order  to  incorporate  long  memory  into  the  family  of  GARCH  models,  (Baillie 
et  al.  1996)  defined  a  fractionally  integrated  GARCH  (FIG ARCH)  process  as  a  causal 
strictly  stationary  solution  of  the  difference  equations  (7.3.9)  and  (7.3.10)  specified 
below. 

To  motivate  the  definition,  we  recall  that  the  GARCH (p,  q )  process  is  the  causal 
stationary  solution  of  the  equations, 

__  p  q 

z,  =  y/h,et,  h,  =  a0  +  ^  ^  (7.3.6) 

i=l  i=  1 

where  ao  >  0,  oq,  . . . ,  cfy  >  0  and  /3\, . . . ,  jiq  >  0.  It  follows  (Problem  7.5)  that 

(1  -  a(B)  -  p(B))Z2  =  a0  +  (1  -  (7.3.7) 

where  {Wt  \=  Zf  —  ht]  is  white  noise,  a (B)  =  Yfi=i  ai Bl  an^  P(B)  =  Y^!=  1  There 
is  a  causal  weakly  stationary  solution  for  {Zt}  if  and  only  if  the  zeroes  of  1  —  a(z)  — 
P(z)  have  absolute  value  greater  than  1  and  there  is  then  exactly  one  such  solution 
(Bollerslev  1986). 


In  order  to  define  the  IGARCH(p,  q)  (integrated  GARCH (p,  q))  process,  Engle 
and  Bollerslev  (1986)  supposed  that  the  polynomial  (1  —  a(z)  —  P(z))  has  a  simple 
zero  at  z=  1,  and  that  the  other  zeroes  all  fall  outside  the  closed  unit  disc  as  in  (7.3.6). 
Under  these  assumptions  we  can  write 

(l-i8(z)-a(z))  =  (l-z)0(z), 

where  </>(z)  is  a  polynomial  with  all  of  its  zeroes  outside  the  unit  circle.  We  then  say 
[cf.  (7.3.6)]  that  {Zt}  is  an  IGARCH(/?,  q)  process  if  it  satisfies 

<KB)(  1  -  B)Z2  =  a0  +  (1  -  P(B))Wt,  (7.3.8) 

with  Zt  =  \fhtet,  Wt  —  Z2  —  ht  and  { et }  ~  IID(0,  1).  Bougerol  and  Picard  (1992) 
showed  that  if  the  distribution  of  et  has  unbounded  support  and  no  atom  at  zero  then 
there  is  a  unique  strictly  stationary  causal  solution  of  these  equations  for  {Zt}.  The 
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solution  has  the  property  that  EZ 2  —  oo.  In  practice,  for  GARCH  models  fitted  to 
empirical  data,  it  is  often  found  that  a(l)+/3(l)  ~  1,  supporting  the  practical  relevance 
of  the  IGARCH  model  even  though  EZ 2  =  oo. 

Baillie  et  al.  (1996)  defined  the  FIGARCH(/?,  d,  q )  process  {Zt}  to  be  a  causal 
strictly  stationary  solution  of  the  equations, 

z,  =  y/h,e„  (7.3.9) 

and  [cf.  (7.3.8)] 

4>(B)(  1  -  B)dZi f  =  a0  +  (1  -  P(B))Wt,  0  <  d  <  1,  (7.3.10) 

where  Wt  =  Z2  —  ht,  {et}  ~  IID(0,  1)  and  the  polynomials  </>(z)  and  1  —  /3(z)  are 
non-zero  for  all  complex  z  such  that  \z\  <  1.  Substituting  Wt  —  Z 2  —  ht  in  (7.3.10)  we 
see  that  (7.3.10)  is  equivalent  to  the  equation, 

hr  =  —^7 -  +  [1  -  (1  -B)d]zj,  (7.3.11) 

1-P(1) 

which  means  that  the  FIGARCH(/?,  q )  process  can  be  regarded  as  a  special  case  of  the 
IARCH(oo)  process  defined  by  (7.3.9)  and 

oo 

ht  =  ao  +  ajZf~j,  (7.3.12) 

7=1 

with  ao  >  0  and  Yljli  aj  —  1-  The  questions  of  existence  and  uniqueness  of  causal 
strictly  stationary  solutions  of  the  IARCH(oo)  (including  FIGARCH)  equations  have 
not  yet  been  fully  resolved.  Any  strictly  stationary  solution  must  have  infinite  variance 
since  if  a2  :=  EZ2  =  Eht  <  oo  then,  since  aj  —  it  follows  from  (7.3.12) 

that  cf2  —  clq  o2,  contradicting  the  finiteness  of  a2.  Sufficient  conditions  for  the 
existence  of  causal  strictly  stationary  solution  of  the  IARCH)(oo),  and  in  particular  of 
the  FIGARCH  equations,  have  been  given  by  Douc  et  al.  (2008). 

Other  models,  based  on  changing  volatility  levels,  have  been  proposed  to  explain 
the  “long-memory”  effect  in  stock  and  exchange  rate  returns.  Fractionally  integrated  E- 
GARCH  models  have  also  been  introduced  (Bollerslev  and  Mikkelsen  1996)  in  order 
to  account  for  both  long  memory  and  asymmetry  of  the  effects  of  positive  and  negative 
shocks  et  . 


7.4  Stochastic  Volatility  Models 

The  general  discrete-time  stochastic  volatility  (SV)  model  for  the  log  return  sequence 
{Zt}  defined  in  Section  7.1  is  [cf.  (7.2.1)] 

Zt  =  jh,e„  feZ,  (7.4.1) 

where  { et }  ~  IID(0,  1),  {ht}  is  a  strictly  stationary  sequence  of  non-negative  random 
variables,  independent  of  { et },  and  ht  is  known,  like  the  corresponding  quantity  in  the 
GARCH  models,  as  the  volatility  at  time  t.  Note  however  that  in  the  GARCH  models, 
the  sequences  {ht}  and  {et}  are  not  independent  since  ht  depends  on  es,  s  <  t  through 
the  defining  equation  (7.2.6). 

The  independence  of  {ht}  and  { et }  in  the  SV  model  (7.4.1)  allows  us  to  model  the 
volatility  process  with  any  non-negative  strictly  stationary  sequence  we  may  wish  to 
choose.  This  contrasts  with  the  GARCH  models  in  which  the  processes  { Zt }  and  {ht} 
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are  inextricably  linked.  Inference  for  the  GARCH  models,  based  on  observations  of 
Zi, . . . ,  Z„,  can  be  carried  out  using  the  conditional  likelihood,  which  is  easily  written 
down,  as  in  (7.2.14),  in  terms  of  the  marginal  probability  density  of  the  sequence  {et}. 
Inference  for  an  S  V  model  based  on  observations  of  {Zt}  however  is  considerably  more 
difficult  since  the  process  is  driven  by  two  independent  random  sequences  rather  than 
one  and  only  { Zt }  is  observed.  The  unobserved  sequence  {ht}  is  said  to  be  latent. 

A  general  account  of  the  probabilistic  properties  of  SV  models  can  be  found  in 
Davis  and  Mikosch  (2009)  and  an  extensive  history  and  overview  of  both  discrete-time 
and  continuous-time  SV  models  in  Shephard  and  Andersen  (2009).  In  this  section  we 
shall  focus  attention  on  an  early,  but  still  widely  used,  special  case  of  the  SV  model 
due  to  Taylor  (1982,  1986)  known  as  the  lognormal  SV  model. 

The  lognormal  SV  process  {Zt}  is  defined  as, 

Z,  =  y/htet,  {e,}  ~  IID  N(0,  1),  (7.4.2) 

where  ht  =  elt ,  {£  J  is  a  (strictly  and  weakly)  stationary  solution  of  the  equations 

It  =  Yo  +  Yi&t-i  +  th,  {fit}  ~  HD  N(0,  a2),  (7.4.3) 

lyil  <  1  and  the  sequences  {et}  and  {rjt}  are  independent.  The  sequence  {£t}  is  clearly 
a  Gaussian  AR(1)  process  with  mean 

1H  :=  El,  =  (7.4.4) 

i  -  n 

and  variance 

a2 

v£:=Var  (£,)  =  - - (7.4.5) 

1  -  Y\ 

Properties  of  { Zt] . 


(i)  {Zt}  is  strictly  stationary. 

(ii)  Moments: 


EZ\  =  E(ert)Eexp(rlt/2) 


I0, 

|  [n™,  (2.-1)]  exp 


/  myp  .  _mV_\ 
V1  ~n  "l"  2(1  -yf))  ’ 


(iii)  Kurtosis: 


EZf 

(EZ?)2 


if  r  is  odd, 
if  r  —  2m. 


Kurtosis  (defined  by  the  ratio  on  the  left)  is  a  standard  measure  of  tail  heaviness. 
For  a  normally  distributed  random  variable  it  has  the  value  3,  so,  as  measured  by 
kurtosis,  the  tails  of  the  marginal  distribution  of  the  lognormal  SV  process  are 
heavier  than  those  of  a  normally  distributed  random  variable. 

(iv)  The  autocovariance  function  of  [Zf}\ 

We  first  observe  that  if  t  >  s. 


E(Z2Z2 \eu,  T]u,  u  <t)  -  hsh,e2E(e2 \eu,  i]u,  u  <  t)  =  hshte], 

since  hs ,  ht  and  e2s  are  each  functions  of  {eu,  rju,u  <  t}  and  e2  is  independent  of 
{eu,  rju,u  <  t}.  Taking  expectations  on  both  sides  of  the  last  equation  and  using 
the  independence  of  {ht}  and  { et }  and  the  relation  ht  —  exp (lt)  gives 


E(Z2Z2)  =  E  exp(7,  +  l  s). 
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Hence,  for  h  >  0, 

Co  y(Zf+h,  Z2t)  =  Eexp(£t+h  +  £,)  -  E  exp(£t+h)E  exp(C) 
=  exp [2/^  +  V(:  ( 1  +  yf )  ]  -  exp[2/x^  +  v€]. 


Here  we  have  used  the  facts  that  lt+/1  is  normally  distributed  with  mean  and 
variance  which  are  easily  computed  from  (7.2.17)  and  that  for  a  normally 
distributed  random  variable  X  with  mean  /z  and  variance  v,  E  exp(X)  =  exp  (/z  + 
v/2).  From  (ii)  we  also  have 

Var(Z2)  =  EZf  -  (EZ2)2  =  3  exp(2/z^  +  2v/)  -  exp(2 +  v/). 


Hence,  for  /z  >  0, 

2  =  Cov(Z~+;,.  Z~)  =  expfayf)  -  1  ^ _ v, _  ^  0 

z'  Var(Zr2)  3  exp(vv)  —  1  3  exp(vf)  —  1  1  ’ 

suggesting  the  approximation  of  the  autocorrelation  function  of  {Z2}  by  that  of 
an  ARMA(1,1)  process.  (Recall  from  Example  3.2.1  that  the  autocorrelation 
function  of  an  ARM A(  1,1)  process  has  the  form  p(h)  =  h  >  1 ,  with 
p  (0)  =  1 .)  There  is  a  similarity  here  to  the  autocovariance  function  of  the  squared 
GARCH(1,1)  process  which  (see  Problem  7.3)  has  the  autocovariance  function 
of  an  ARM A(  1,1)  process. 

The  process  {InZ2}: 

InZ2  =  lt  +  ln<?2.  (7.4.6) 

If^~N(0,  ^thenE’ln^2  =  —1.27  and  Var(ln<?2)  =  4.93.  From  (7.4.6)  we  find 

at  once  that  Var(lnZ2)  =  v/  +  4.93  and  Cov(Z2+/z,  Zt)  —  v/y^/z|  for  h  ^  0.  Hence 
the  process  {InZ2}  has  the  autocovariance  function  of  an  ARMA(1,1)  process 
with  autocorrelation  function 


PlnZj  W 


v/Zi 


I  h\ 


v/  +  4.93  ’ 


h±  0. 


Estimation  for  the  lognormal  SV  model 

The  parameters  to  be  estimated  in  the  defining  equations  (7.4.2)  and  (7.4.3)  are  a2,  y o 
and  y\ .  They  can  be  estimated  by  maximization  of  the  Gaussian  likelihood  which  can 
be  calculated,  for  any  specified  values  of  the  parameters,  as  follows. 

By  property  (v)  above,  the  process  [Yt  \=  InZ2  —  ElnZ2}  satisfies  the  ARM A(  1,1) 
equations, 

Yt  ~  <pYt-i  =  Z,  +  ezt_u  {Zt}  ~  WN(0,  or|),  (7.4.7) 

for  some  coefficients  0  and  6  in  the  interval  (—1,  1)  and  white-noise  variance  <j|. 
Comparing  the  autocorrelation  function  of  (7.4.7)  with  the  autocorrelation  function  of 
{InZ2}  given  above  in  Property  (v),  we  find  that 

Yi  =  0  (7.4.8) 


and 


n  _  (0  +  0)  ( l  +  #0) 

+  4.93  ”  1  +  200  +  92 


(7.4.9) 


To  ensure  that  the  right-hand  side  falls  in  the  interval  (0,  1)  it  is  necessary  and  sufficient 
(assuming  that  0  e  (—1.1)  and  0  e  (—1,1))  that  0  +  0  >  0.  The  maximum 
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Gaussian  likelihood  estimators  0  and  9  can  be  found  using  the  program  ITSM  and 
the  corresponding  estimators  y\  and  vi  on  replacing  0  and  9  by  their  estimators  in 
(7.4.8)  and  (7.4.9)  respectively.  From  (7.4.5)  the  corresponding  estimator  of  o2  is 

O2  =  (1  -  Yi2)Vi, 

A 

where  y\  =  0  and,  from  (7.4.4)  and  (7.4.6),  the  corresponding  estimator  of  yo  is 
Yo  =  (1  -y,)(Tz2+  1.27), 

where  In  Z2  denotes  the  sample  mean  of  the  observations  of  lnZr2.  If  it  turns  out  that 

/v  zv  zv  yy 

the  estimators  0  and  9  satisfy  0  +  9  <  0  then,  from  (7.4.9),  <  0,  suggesting  that 

the  lognormal  SV  model  is  not  appropriate  in  this  case. 

Forecasting  the  log  volatility 

The  minimum  mean-squared  error  predictor  of  lt+h  conditional  on  {£s,  s  <  t}  is 
easily  found  from  (7.4.3)  to  be 

PA+h  =  Yilt  +  Yo (7.4.10) 

1  -  Yt 

with  mean-squared  error, 

1  —  v2h 

E(lt+h  -  PA+h )2  =  a2— (7.4.11) 

1  -  v, 

We  have  seen  how  to  estimate  yo,  y\  and  cr2,  but  unfortunately  lt  is  not  observed. 
In  order  to  forecast  lt+h  using  the  observations  { Zs ,  51  <  t },  we  can  however  use  the 
Kalman  recursions  as  described  in  Section  9.4,  Example  9.4.2 


7.5  Continuous-Time  Models 

7.5.1  Levy  Processes 

Continuous-time  models  for  asset  prices  have  a  long  history,  going  back  to  Bachelier 
(1900)  who  used  Brownian  motion  to  represent  the  movement  of  asset  prices  in 
the  Paris  stock  exchange.  Continuous-time  models  have  since  moved  to  a  central 
place  in  mathematical  finance,  largely  because  of  their  use  in  the  field  of  option¬ 
pricing,  initiated  by  the  Nobel-Prize- winning  work  of  Black,  Scholes  and  Merton,  and 
partly  also  because  of  the  current  availability  of  high-frequency  and  irregularly- spaced 
transaction  data  which  are  represented  most  naturally  by  continuous-time  models. 

We  earlier  defined  the  daily  return  on  day  t  of  a  stock  whose  closing  price  is  Pt  as 

Zt  =  Xt-Xt_u  (7.5.1) 

where 

=  logP,  (7.5.2) 

is  the  log  asset  price  at  the  close  of  day  t.  If  the  daily  returns  were  iid  this  would  mean 
that  the  process  {XJ  is  a  random  walk  (Example  1.4.3).  This  is  an  over-simplified 
model  for  daily  asset  prices  as  there  is  very  strong  evidence  suggesting  that  the  daily 
returns,  although  exhibiting  little  or  no  autocorrelation,  are  not  independent. 
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Definition  7.5.1 


Example  7.5.1 


Example  7.5.2 


Nevertheless  it  will  be  a  useful  starting  point,  in  the  construction  of  continuous¬ 
time  models  to  introduce  the  continuous-time  analogue  of  a  random  walk,  known  as  a 
Levy  process.  Like  iid  noise  in  discrete  time,  it  is  the  building  block  for  the  construction 
of  a  large  family  of  more  complex  models  for  financial  data. 


A  Levy  process,  |L(t),  t  e  R}  is  a  process  with  the  following  properties: 

(i)  L(  0)  =  0. 

(ii)  L(t )  —  L(s )  has  the  same  distribution  as  L(t  —  s )  for  all  s  and  t  such  that  s  <  t. 

(iii)  If  (s,  t)  and  (w,  v)  are  disjoint  intervals  then  L{t)  —  L(s)  and  L(v)  —  L{u)  are 
independent. 

(v)  [L(t)}  is  continuous  in  probability,  i.e.  for  all  6  >  0  and  for  all  t  e  R, 

limP(|L(t)  —  L(s)|  >  c)  =  0. 


The  essential  properties  of  Levy  processes  are  discussed  in  Appendix  D.  For  thorough 
accounts  of  Levy  processes  and  their  properties  see  the  books  of  Applebaum  (2004), 
Protter  (2010)  and  Sato  (1999)  and  for  an  extensive  account  of  their  applications  to 
finance  see  Schoutens  (2003)  and  Andersen  et  al.  (2009).  For  now  we  restrict  attention 
to  two  of  the  most  familiar  examples  of  Levy  processes,  Brownian  motion,  whose 
sample-paths  are  continuous,  and  the  compound  Poisson  process,  whose  sample-paths 
are  constant  except  for  jumps. 

Brownian  Motion 

This  is  a  Levy  process  for  which  L(t)  ~  o2t ),  t  >  0,  with  parameters  /z  e  R 

and  a  >  0.  The  sample-paths  are  continuous  and  the  characteristic  function  of  L{t) 
for  t  >  0  is 

Eeiem  =  e'm,  9  €  M,  (7.5.3) 

where 

$(#)  =  id ix  -  e2o2ii. 

The  defining  properties  (ii)  and  (iii)  imply  that  for  any  finite  collection  of  times  t\  < 
t2  <  •  •  •  <  tn,  the  increments  At  \=  L(ti+ 1)  —  L(^),  i  —  1, . . . ,  n,  are  independent 
random  variables  satisfying  A,  ~  N(/z(t/+i  —  ti),a2(ti+  \  —  //)).  Brownian  motion 
with  /z  =  0  and  a  —  1  is  known  as  standard  Brownian  motion.  We  shall  denote  it 
henceforth  as  { B(t ),  t  e  R}.  A  realization  of  B(t ),  0  <  t  <  10,  is  shown  in  Figure  7-5. 

□ 


The  Poisson  Process 

The  Poisson  process  ?gR}  with  intensity  or  jump-rate  X  is  a  Levy  process  such 

that  N(t),  for  t  >  0,  has  the  Poisson  distribution  with  mean  Xt.  Its  sample  paths  are 
right-continuous  functions  which  are  constant  except  for  jumps  of  size  1,  the  number 
of  jumps  occurring  in  any  time  interval  of  length  l  having  the  Poisson  distribution  with 
mean  XI.  The  characteristic  function  of  N(t)  for  t  >  0  is  given  by  (7.5.3)  with 

i;(d)  =  ew  -  1. 


214 


Chapter  7  Time  Series  Models  for  Financial  Data 


Figure  7-5 

A  realization  of  standard 
Brownian  motion 
£(0,0  <t<  10 


Figure  7-6 

A  realization  of  a  Poisson 
process  N(t),  0  <  t  <  1 0, 
with  jump-rate  5  per  unit 

time 


A  sample-path  of  a  Poisson  process  with  A  =  5  on  the  time-interval  [0,  10]  is  shown 
in  Figure  7-6. 

□ 

Example  7.5.3  The  Compound  Poisson  Process 

The  compound  Poisson  process  {X(7),  t  e  R}  with  jump-rate  A  and  jump-size 
distribution  function  F  is  a  Levy  process  with  sample-paths  which  are  constant  except 
for  jumps.  The  jump-times  are  those  of  a  Poisson  process  {N(t)}  with  jump-rate  A  and 
the  sizes  of  the  jumps  are  independent  random  variables,  independent  of  the  process 
{N(t)},  with  a  distribution  function  F  assigning  probability  zero  to  the  value  zero.  The 
characteristic  function  of  L{t)  for  t  >  0  is  again  given  by  (7.5.3)  but  now  with 

£(0)  =  i6c  +  I  (eWx  -  1  -  i6xI(-U)(x))kdF(x),  (7.5.4) 

JR 

where  c  =  A  f.,<lxdF(x)  and  /(_ij)(x)  =  1  if  |jc|  <  1  and  zero  otherwise. 
A  realization  of  a  compound  Poisson  process  on  the  interval  [0,10]  is  shown  in 
Figure  7-7 

□ 

The  above  examples  give  some  idea  of  the  immense  variety  in  the  class  of  Levy 
processes.  The  Levy-Ito  decomposition  implies  that  every  Levy  process  L  can  be 
expressed  as  the  sum  of  a  Brownian  motion  and  an  independent  pure-jump  process. 
The  marginal  distribution  of  L(t)  can  be  any  distribution  from  the  class  of  infinitely 
divisible  distributions  (which  includes  the  gamma,  Gaussian,  Student’s  t,  stable, 
compound  Poisson  and  many  additional  well-known  distributions).  See  Appendix  D 
and  the  references  given  there  for  more  details. 
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Figure  7-7 

A  realization  of  a 
compound  Poisson  process 
X(t),  0  <  t  <  1 0,  with 
jump-rate  5  per  unit  time 
and  jump-size  distribution 
normal  with  mean  0  and 

variance  1 


7.5.2  The  Geometric  Brownian  Motion  (GBM)  Model  for  Asset  Prices 

In  his  pioneering  mathematical  analysis  of  stock  prices,  contained  in  his  doctoral 
thesis,  Theorie  de  la  speculation ,  Bachelier  (1900)  introduced  a  model  in  which 
the  price  of  an  asset  {P(01  is  Brownian  motion  with  parameters  /z  and  o  (see 
Example  7.5.1).  Measuring  time  in  units  of  1  day,  this  implies  in  particular  that  the 
daily  closing  prices,  P(t ),  t  —  0,  1,  2,  . . .,  constitute  a  random  walk  with  increments 
P(t )  —  P(t  —  1)  which  are  independent  and  normally  distributed  with  mean  pc  and 
variance  a2.  The  normality  of  these  increments  and  the  fact  that  P(t)  takes  negative 
values  with  positive  probability  clearly  limit  the  value  of  this  model  as  a  realistic 
approximation  to  observed  daily  prices.  However,  interest  in  the  work  of  Bachelier 
and  his  use  of  the  Brownian  motion  model  to  solve  problems  in  mathematical  finance 
led  (Samuelson  1965)  to  develop  and  apply  the  more  realistic  geometric  Brownian 
motion  model  for  asset  prices.  A  fascinating  account  of  Bachelier’s  work,  including 
an  English  translation  of  his  thesis  and  comments  on  its  place  in  the  history  of  both 
probability  theory  and  mathematical  finance  is  contained  in  the  book  of  Davis  and 
Etheridge  (2006).  The  geometric  Brownian  motion  model  is  the  one  for  which  the 
celebrated  option-pricing  formulae  of  Black,  Scholes  and  Merton  were  first  derived. 

In  the  Brownian  motion  model  the  asset  price  { P(t )  ,t>  0}  satisfies  the  stochastic 
differential  equation, 

dP(t)  —  fjidt  +  crdB(t),  (7.5.5) 

where  {5(0}  is  standard  Brownian  motion,  i.e.,  Brownian  motion  with  EB(t)  —  0  and 
Var5(0  =  t,  t  >  0.  Equation  (7.5.5)  is  shorthand  for  the  integrated  form, 

P(t)-P(0)  =  pit  +  oB(t). 

In  addition  to  the  obvious  flaw  that  P{t)  will  take  negative  values  for  some  values 
of  t,  the  increments  P{t)  —  P(t  —  1)  are  normally  distributed,  while  in  practice  it  is 
observed  that  these  increments  have  marginal  distributions  with  heavier  tails  than  the 
normal  distribution.  The  geometric  Brownian  motion  model  addresses  both  of  these 
shortcomings. 

The  geometric  Brownian  motion  model  for  {P(t),  t  >  0}  is  defined  by  the  Ito 
stochastic  differential  equation, 

dP(t)  =  P(t)[pidt  +  crdB(t)],  withP(O)  >  0.  (7.5.6) 

Solution  of  this  equation  requires  knowledge  of  Ito  calculus,  a  brief  introduction  to 
which  is  given  in  Appendix  D.  A  more  extensive  and  very  readable  account  with 
financial  applications  can  be  found  in  the  book  of  Mikosch  (1998).  The  solution  of 
(7.5.6)  satisfies  (see  Appendix  D) 
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Figure  7-8 

A  realization  of  GBM, 
P(0,  0  <  t  <  1 0,  with 
P(0)  =  1 .  /x  =  0  and 
cr  =  0.01 


P(t )  =  P(0)  exp 


a 


ill  -  —)t  +  oB(t) 


(7.5.7) 


from  which  it  follows  at  once  that  the  log  asset  price  X(t )  =  log  P{t)  satisfies 

.2 


a 


X{t)=X(Q)  +  {iL-—)t  +  aB(t), 


(7.5.8) 


or  equivalently 


dX(t)  —  ( fi  — 


a 


dt  +  adB(t). 


(7.5.9) 


A  realization  of  the  process  P(t ),  0  <  r  <  10,  with  P(0)  =  1,  /z  =  0  and  cr  =  0.01  is 
shown  in  Figure  7-8. 

The  return  for  the  time  interval  (t  —  A ,  t)  is 

a2 

ZA(t)  =  X(t)  -  X(t  -  A)  =  (/x - )A  +  or [5(0  -  S(r  -  A)].  (7.5.10) 

For  disjoint  intervals  of  length  A  the  returns  are  therefore  independent  normally 
distributed  random  variables  with  mean  (/z— a2 / 2)  A  and  variance  a2  A.  The  normality 
of  the  returns  implied  by  this  model  is  a  property  which  can  easily  be  checked  against 
observed  returns.  It  is  found  from  empirically  observed  returns  that  the  deviations  from 
normality  are  substantial  for  time  intervals  of  the  order  of  a  day  or  less,  becoming  less 
apparent  as  A  increases.  This  is  one  of  the  reasons  for  developing  the  more  complex 
models  described  in  later  sections. 


Remark  1.  An  asset-price  model  which  overcomes  the  normality  constraint  is  the  so- 
called  Levy  market  model  (LMM),  in  which  the  log  asset  price  X  is  assumed  to  be  a 
Levy  process,  not  necessarily  Brownian  motion  as  in  the  GBM  model.  For  a  discussion 
of  such  models  see  Eberlein  (2009). 

The  parameter  a2  in  the  GBM  model  is  called  the  volatility  parameter.  It  plays 
a  key  role  in  the  option  pricing  analysis  of  Black  and  Scholes  (1973)  and  Merton 
(1973)  to  be  discussed  in  Section  7.6.  Although  a2  cannot  be  determined  from  discrete 
observations  of  a  GBM  process  it  can  be  estimated  from  closely-spaced  discrete 
observations  X(i/N),  i  =  1 ,  ,N,  with  large  N ,  as  described  in  the  following 

paragraph. 

From  (7.5.8)  we  can  write 

(AiX)2  :=  \X(i/N)  -  X((i  -  \)/N)f  =  (c/N  +  a AtB)2 , 


(7.5.11) 
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where  AtB  —  B(i/N )  —  B((i  —  1)/N)  and  c  —  i±  —  cr2/2.  A  simple  calculation  then 
gives 

E[(Zi;X)2]  =  —  +  — , 

N  N2 

and 

,  4ct2c2  2cr 4 
Var  [(A.X)2]  =  ^  +  — . 

By  the  independence  of  the  summands,  Ylf=i(AiX)2  has  mean  a2  -he2  /N  and  variance 
2cr4/N  +  4 cr2c2/N2,  showing  that,  as  N  ->  oo, 


N 

J2(AiX)2  ^ 

i=  1 


o2  =  (  a2dt. 

Jo 


(7.5.12) 


This  calculation  shows  that,  for  the  GBM  process,  the  sum  on  the  left  is  a  consistent 
estimator  of  o2  as  N  — >  oo.  The  sum  (for  suitably  large  N )  is  known  as  the  realized 
volatility  for  the  time  interval  [0,  1]  and  the  integral  on  the  right  is  known  as  the 
integrated  volatility  for  the  same  interval,  a2  itself  is  known  as  the  spot  volatility. 
The  realized  volatility  is  widely  used  as  an  estimator  of  the  integrated  volatility  and 
is  consistent  for  a  wide  class  of  models  in  which  the  spot  volatility  is  not  necessarily 
constant  as  it  is  in  the  GBM  model.  For  a  discussion  of  realized  volatility  in  a  more 
general  context  see  the  article  of  Andersen  and  Benzoni  (2009). 

We  shall  denote  the  realized  volatility,  computed  for  day  n,  n  =  1,  2,  3,  . . .,  by  a2. 
It  is  found  in  practice  to  vary  significantly  from  1  day  to  the  next.  The  sequence  {a2}  of 
realized  volatilities  exhibits  clustering,  i.e.,  periods  of  low  values  interrupted  by  bursts 
of  large  values,  and  has  the  appearance  of  a  positively  correlated  stationary  sequence, 
reinforcing  the  view  that  volatility  is  not  constant  as  in  the  GBM  model  and  suggesting 
the  need  for  a  model  in  which  volatility  is  stochastic.  Such  observations  are  precisely 
those  which  led  to  the  development  in  discrete  time  of  stochastic  volatility,  ARCH, 
and  GARCH  models,  and  suggest  the  need  for  analogous  models  with  continuous  time 
parameter. 


7.5.3  A  Continuous-Time  SV  Model 

In  the  discrete-time  modeling  of  asset  prices  we  have  seen  how  both  the  GARCH 
and  SV  models  allow  for  the  variation  of  the  volatility  with  time  by  modeling  {ht}  as  a 
random  process.  A  continuous-time  analogue  of  this  idea  was  introduced  by  Barndorff- 
Niesen  and  Shephard  (2001)  in  their  celebrated  continuous-time  SV  model  for  the  log 
asset  price  X(t)  [cf.  (7.5.9)], 

dX(t)  =  [m  +  bh(t)]dt  +  >J h(t)dB(t) ,  t  >  0,  with  X(0)  =  0,  (7.5.13) 

where  m  e  R,  b  e  R,  {B(t)}  is  standard  Brownian  motion  and  [h(t)}  is  a 
stationary  subordinator-driven  Ornstein-Uhlenbeck  process  independent  of  {£(0I- 
The  connection  with  discrete-time  SV  models  is  clear  if  we  set  m  =  b  =  0  in  (7.5.13) 
and  compare  with  (7.4.1).  Notice  also  that  (7.5.13)  has  the  same  form  as  the  GBM 
equation  (7.5.9)  except  that  the  constant  volatility  parameter  a2  has  been  replaced  by 
the  random  volatility  h(t). 

A  subordinator  is  a  Levy  process  with  non-decreasing  sample  paths.  The  simplest 
example  of  a  subordinator  is  the  Poisson  process  of  Example  7.5.2.  If  the  compound 
Poisson  process  in  Example  7.5.3  has  non-negative  jumps,  i.e.,  if  the  jump-size 
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distribution  function  F  satisfies  F(0)  =  0,  then  it  too  is  a  subordinator.  Other  examples 
of  subordinators  are  the  gamma  process  (see  Appendix  D),  whose  increments  on 
disjoint  intervals  have  a  gamma  distribution,  and  the  stable  subordinators,  whose 
increments  on  disjoint  intervals  are  independent  non-negative  stable  random  variables. 

An  Ornstein-Uhlenbeck  process  driven  by  the  subordinator  L  satisfies  the  stochas¬ 
tic  differential  equation, 

dh(t)  —  Xh(t)dt  +  dL(t ),  t  e  R,  (7.5.14) 


where  A  <  0.  If  EL( l)r  <  oo  for  some  r  >  0  this  equation  has  a  unique  strictly 
stationary  causal  solution 


(7.5.15) 


(Causal  here  means  that  h(t)  is  independent  of  the  increments  { L{u )  —L(t)  :  u  >  t]  for 
every  t.)  A  crucial  feature  of  (7.5.15)  is  the  non-negativity  of  h{t)  which  follows  from 
the  non-decreasing  sample-paths  of  the  subordinator  (L(f)}  and  the  non-negativity  of 
the  integrand.  Non-negativity  is  clearly  a  necessary  property  if  h(t)  is  to  represent 
volatility.  For  a  detailed  account  of  Levy-driven  stochastic  differential  equations  and 
integrals  with  respect  to  Levy  processes,  see  Protter  (2010).  In  the  case  when  L  is  a 
subordinator,  (7.5.15)  has  the  very  simple  interpretation  as  a  pathwise  integral  with 
respect  to  the  non-decreasing  sample-path  of  L. 

Quantities  associated  with  the  model  (7.5.13)  which  are  of  particular  interest  are 
the  returns  over  time  intervals  of  length  A  >  0,  i.e. 


Yn  :=  X(nA)  -  X((n  -  1  )A),  ne  N, 


and  the  integrated  volatilities, 

rnA 

In  =  I  h{t)dt,  n  g  N. 

J{n-\)A 

The  interval  A  is  frequently  one  trading  day.  The  return  for  the  day  is  an  observ¬ 
able  quantity  and  the  integrated  volatility,  although  not  directly  observable,  can 
be  estimated  from  high-frequency  within-day  observations  of  X{t),  as  discussed  in 
Section  7.5.2  for  the  GBM  model. 

For  the  model  (7.5.13)  with  any  second-order  stationary  non-negative  volatility 
process  h  which  is  independent  of  B  and  has  the  properties, 

Eh(t )  =  f ,  Var (h(t))  =  co2 
and 

Cov(h(t),  h(t  +  5))  =  co2p(s),  s  €  R, 
it  can  be  shown  (Problem  7.8)  that  the  stationary  sequence  {/„}  has  mean, 

EIn  =  %  A.  (7.5.16) 

and  autocovariance  function. 


Yi(k) 


I2co2r(A),  ifk  =  0, 

co2  [r((k  +  1  )A)  -  2r(kA)  +  r((k  -  l)zf)] ,  if  k  >  1. 


(7.5.17) 
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Example  7.5.4. 


where 


r(t)  :=  f  f  p(u)dudy.  (7.5.18) 

Jo  Jo 

The  stationary  sequence  of  log  returns  [Yn]  has  mean  m  +  b% 8  and  autocovariance 
function, 


\b2yi(  0)  +  $A,  if  k  =  0, 

yY(k)  =  \  (7.5.19) 

yb2yi(k),  if  k  >  1. 

If  in  addition  m  =  b  =  0  then  the  log  returns  {Yn}  are  uncorrelated  while  the  squared 
sequence  {Yn}  (see  Problem  7.11)  has  mean, 

EY2  =  %A  (7.5.20) 


and  autocovariance  function, 


ICO2  \6 r(A)  +  2A2%2/(02] ,  ifk  =  0, 

co2  [r((k  +  1)4)  -  2 r(kA)  +  r((k  -  1  )A)] ,  ifk>l. 

(7.5.21) 

Thus,  under  these  assumptions,  the  log  returns,  Yn,  calculated  from  the  model  are 
uncorrelated  while  the  squares,  F;^,  are  correlated,  showing  that  the  log  returns  are 
uncorrelated  but  not  independent,  in  keeping  with  the  '“stylized  facts”  associated  with 
empirically  observed  log  returns. 


The  Omstein-Uhlenbeck  SV  Model  with  m  =  b  =  0 


We  can  use  the  results  (7.5.16)-(7.5.21)  to  determine  properties  of  the  sequences  {Yn}, 
{  Y^}  and  { In }  associated  with  the  Ornstein-Uhlenbeck  SV  model, 

dX(t)  =  yj h(t)dB(t ),  t  >  0,  withX(O)  =  0,  (7.5.22) 


where 


(7.5.23) 


A  <  0  and  EL( l)2  <  oo. 

In  order  to  apply  (7.5.16)-(7.5.21)  we  need  to  determine  £  =  Eh(t),  co2  = 
Var (h(t))  and  the  autocorrelation  function  p  of  h.  To  this  end  we  rewrite  (7.5.23)  as 


where 


g(t  —  u)dL(u ), 


IeXx,  if  x  >  0, 

0,  otherwise 


(7.5.24) 


(7.5.25) 


The  function  g  in  the  representation  (7.5.24)  is  called  a  kernel  function.  If  EL( l)2  < 
oo,  as  we  shall  assume  from  now  on,  and  iff  and  g  are  integrable  and  square-integrable 
functions  on  R,  we  have  (see  Appendix  D), 

/oo  n  oo 

f{t  —  u)dL(u )  =  /x  /  f(u)du  (7.5.26) 

-oo  7—oo 
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and 


Cov 


/oo 

fit  —  u)dL(u), 

-oo 


g{t  —  u)dLiu) 


fiu)giu)du , 


(7.5.27) 


where  /z  =  EL(1)  and  o2  =  Var(L(l)).  Taking  g  as  in  (7.5.25)  and  fix)  =  gis+x),  x  e 
R,  we  find  from  these  equations  that  the  mean  and  autocovariance  function  of  the 
volatility  process  [hit)}  defined  by  (7.5.23)  are  given  by 


and 

a2 

Co \(h(t  +  s ),  hit))  =  — —  eks  =  co2p(s),  s>  0, 

Z|A| 

where  co2  =  Var (/z(0)  =  cr2/(2| A. |)  and  pis)  =  eks ,  s  >  0.  Substituting  for  p  into 
(7.5.17)  gives 

r (0  =  ^  “  1  “  M  • 

We  can  now  substitute  for  £,  <z>2,  p  and  r  in  equations  (7.5.16)-(7.5.21)  to  get  the 
second-order  properties  of  the  sequences  {Fw},  {F2}  and  {4}.  In  particular  we  find  that 

{Y„}  ~  WN(0, 

£F„2  =  EIn  = 


and 

YY2(k)  =  Yiik)  =  ^\X\-3a2e{k~l)lA(l  -  eXA)2,  k>  1. 

The  validity  of  the  latter  expressions  for  k  >  1  and  not  for  k  =  0  indicates  that 
both  the  squared  return  sequence  {F2}  and  the  integrated  volatility  sequence  { In }  have 
the  autocovariances  of  ARMA(1,  1)  processes.  This  demonstrates,  for  this  particular 
model,  the  covariance  structure  of  the  sequence  {F2}  and  the  consequent  dependence 
of  the  white-noise  returns  sequence  {Yn}- 

□ 


Remark  2.  Since  equations  (7.5.16)-(7.5.19)  (derived  by  Bamdorff-Niesen  and 
Shephard  2001)  apply  to  any  second-order  stationary  non-negative  stochastic  volatility 
process,  /z,  independent  of  B  in  (7.5.13),  they  can  be  used  to  calculate  the  second  order 
properties  of  {Yn}  and  {4}  for  more  general  models  than  the  Ornstein-Uhlenbeck 
model  defined  by  (7.5.13)  and  (7.5.15).  If  m  =  b  =  0  the  second-order  properties 
of  {F2}  can  also  be  calculated  using  equations  (7.5.20)  and  (7.5.21).  In  particular  we 
can  replace  the  Ornstein-Uhlenbeck  process,  h ,  in  Example  7.5.4  by  a  non-negative 
CARMA  process  (see  Section  11.5)  to  allow  a  more  general  class  of  autocovariance 
functions  for  the  sequences  {4}  and  {F2}  in  order  to  better  represent  empirically 
observed  financial  data. 


Remark  3.  Continuous-time  generalizations  of  the  GARCH  process  have  also  been 
developed  (see  Kltippelberg  et  al.  (2004)  and  Brockwell  et  al.  2006).  Details  however 
are  beyond  the  scope  of  this  book. 


7.6  An  Introduction  to  Option  Pricing 
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7.6  An  Introduction  to  Option  Pricing 

We  saw  in  Section  7.5.2  that,  under  the  geometric  Brownian  motion  model,  the  asset 
price  P(t )  satisfies  the  Ito  equation, 

d P(t)  =  P(t)[/jidt  +  crdB(t )]  with  P(0)  >  0,  (7.6.1) 

which  leads  to  the  relation, 

P(t)  —  P( 0)  exp  [(/x  —  a2 /2)t  +  cfB(t)\ .  (7.6.2) 

In  this  section  we  shall  determine  the  value  of  a  European  call  option  on  an  asset 
whose  price  satisfies  (7.6.2).  The  result,  derived  by  Black  and  Scholes  (1973)  and 
Merton  (1973),  clearly  demonstrates  the  key  role  played  by  the  volatility  parameter  a2. 

A  European  call  option,  if  sold  at  time  0,  gives  the  buyer  the  right,  but  not  the 
obligation,  to  buy  one  unit  of  the  stock  at  the  strike  time  T  for  the  strike  price  K.  At 
time  T  the  option  has  the  cash  value  h(P(t ))  =  maxCP(r)  —  K,  0)  since  the  option 
will  be  exercised  only  if  P(T)  >  K ,  in  which  case  the  holder  of  the  option  can  buy  the 
stock  at  the  price  K  and  resell  it  instantly  for  P(T).  However  it  is  not  clear  at  time  0, 
since  P(T)  is  random,  what  price  the  buyer  should  pay  for  this  privilege.  Assuming 

(i)  the  existence  of  a  risk-free  asset  with  price  process, 

D{t)  —  D(0)exp(rf),  r  >  0,  (7.6.3) 

(ii)  the  ability  to  buy  and  sell  arbitrary  (positive  or  negative)  amounts  of  the  stock  and 
the  risk-free  asset  continuously  with  no  transaction  costs,  and 

(iii)  an  arbitrage-free  market  ( i.e.,  a  market  in  which  it  is  impossible  to  make  a  profit 
which  is  non-negative  with  probability  one  and  strictly  positive  with  probability 
greater  than  zero). 

Black,  Scholes  and  Merton  showed  that  there  is  a  unique  value  for  the  option  in  the 
sense  that  both  higher  and  lower  prices  introduce  demonstrable  arbitrage  opportunities. 
Details  of  the  derivation  can  be  found  in  most  books  dealing  with  mathematical 
finance  (e.g.,  Campbell  et  al.  1996;  Mikosch  1998;  Klebaner  2005).  In  the  following 
paragraphs  we  give  a  sketch  of  two  arguments,  following  Mikosch  (1998),  which 
determine  this  value  under  the  assumption  that  the  asset  price  follows  the  GBM  model. 

In  the  first  argument,  we  attempt  to  construct  a  self-financing  portfolio,  consisting 
at  time  t  of  at  shares  of  the  stock  and  bt  shares  of  the  risk-free  asset,  where  at  and  bt 
are  random  variables  which,  for  each  t  are  functions  of  {Z?(y),  s  <  t}.  We  require  the 
value  of  this  portfolio  at  time  t,  namely 

V(t)  =  atP(t)  +  btD(i),  (7.6.4) 

to  satisfy  the  self-financing  condition, 

dV(t)  =  at  d P{t)  +  bt  d D(t),  (7.6.5) 

and  to  match  the  value  of  the  option  at  time  T,  i.e., 

V(T)  =  h(P(T ))  =  ma x(P(T)  -  K,  0).  (7.6.6) 

If  such  an  investment  strategy,  {( at ,  bt),  0  <  t  <  T]  can  be  found,  then  V(0)  must 
be  the  value  of  the  option  at  the  purchase  time  t  —  0.  A  higher  price  for  the  option 
would  allow  the  seller  to  pocket  the  difference  8  and  invest  the  amount  V(0)  in  such 
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a  way  as  to  match  the  value  of  the  option  at  time  T.  Then  at  time  T,  if  P(T)  <  K  the 
option  will  not  be  exercised  and  the  portfolio  and  the  option  will  both  have  value  zero. 
If  P(T)  >  K  the  seller  sells  the  portfolio  for  P(T )  —  K ,  then  buys  one  stock  for  P(T ) 
and  receives  K  for  it  from  the  holder  of  the  option.  Since  there  is  no  loss  involved 
in  this  transaction,  the  seller  is  left  with  a  net  profit  of  8.  The  seller  of  the  option 
therefore  makes  a  profit  which  is  certainly  non-negative  and  strictly  positive  with  non¬ 
zero  probability,  in  violation  of  the  no  arbitrage  assumption.  Similarly  a  lower  price 
than  V(0)  would  create  an  arbitrage  opportunity  for  the  buyer.  In  order  to  determine 
V(t),  at  and  bt  we  look  for  a  smooth  function  v(t,  x),  t  e  [0,  T],  x  >  0,  such  that 

V(t)  =  v(t,P(t)),  te  [0,71,  (7.6.7) 

satisfies  the  conditions  (7.6.4)-(7.6.6). 

Writing  x  for  P(t)  in  v(t,  P(t))  and  applying  Ito’s  formula  (see  Appendix  D)  gives 
3v  3v  1  32v  9 

dv  =  — dt  H - dx  H - -(dx)2  (7.6.8) 

3 1  3x  2  3x2 

where,  from  (7.6.1), 

dx  =  x(pdt  +  adB(t))  (7.6.9) 

and 

(< dx )2  =x2cr2dt.  (7.6.10) 

Applying  Ito’s  formula  to  (7.6.5)  and  using  (7.6.3)  and  (7.6.4)  gives 

dv  —  at(/jLdt  +  adB(t ))  +  r(v  —  atx)dt.  (7.6.11) 

Substituting  (7.6.9)  and  (7.6.10)  into  (7.6.8)  and  comparing  with  (7.6.11),  we  find  that 

9  v 

at  -  — —  (?,  P(t))  (7.6.12) 

dx 

and  that  v(t,  x)  satisfies  the  equation, 

3v  1  9  932v  3v 

- 1 — ax2 — -  +  rx —  =  rv.  (7.6.13) 

dt  2  dx2  dx 

The  condition  (7.6.6)  yields  the  boundary  condition, 

v(T,  x)  =  h(x)  =  max(x  —  K,  0),  (7.6.14) 

which,  with  (7.6.13),  uniquely  determines  the  function  v  and  hence  V(t),  at  and 
bt  =  ( V(t )  —  atP{t))/D{t)  for  each  t  e  [0,  T\.  The  corresponding  investment  strategy 
{(at,  0  <  t  <  T)  satisfies  (7.6.5)  and  (7.6.6)  and  can,  under  the  assumed  idealized 

trading  conditions,  be  implemented  in  practice.  Since  at  time  T  this  portfolio  has  the 
same  value  as  the  option,  V(0)  must  be  the  fair  value  of  the  option  at  time  t  —  0, 
otherwise  an  arbitrage  opportunity  would  arise.  The  option  is  said  to  be  hedged  by  the 
investment  strategy  {( at ,  bt)}.  A  key  feature  of  this  solution  [apparent  from  (7.6.12)- 
(7.6.14)]  is  that  both  the  strategy  and  the  fair  price  of  the  option  are  independent  of  /z, 
depending  on  the  price  process  P  only  through  the  volatility  parameter  a2. 

Instead  of  attempting  to  solve  (7.6.13)  directly  we  now  outline  the  martingale 
argument  which  leads  to  the  explicit  solution  for  v(x,  t),  at  and  bt.  It  is  based  on  the  fact 
that  for  the  GBM  model  with  B(t)  defined  on  the  probability  space  (£2,  II),  there 
is  a  unique  probability  measure  Q  on  (£2,  &)  which  is  equivalent  to  II  (i.e.,  it  has  the 
same  null  sets)  and  which,  when  substituted  for  IT,  causes  the  discounted  price  process 
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P{t)  :=  e~rtP(t)  ,  0  ^  t  ^  5,  to  be  3.  5-martingale,  i.e.,  to  satisfy  the  conditions  that 
EQPit)  <  oo  and 

£g(P(0|5(k),  u<s)  =  P{s)  for  all  0  <  s  <  f  <  T.  (7.6.15) 

The  measure  2  and  the  relation  (7.6.15)  can  be  derived  as  follows.  Applying  Ito’s 
formula  to  the  expression  P(i)  =  e~"P(t )  and  using  (7.6.1)  gives 

d P{t) 

— - =  (ji  —  r)dt  +  adB(t)  =  crdB(t),  (7.6.16) 

Pit ) 

where  Bit)  :=  (/z  —  r)t/a  +  Bit).  The  solution  of  (7.6.16)  satisfies 

Pit)  =  P(0)e°~m~°2t/2.  (7.6.17) 

By  Girsanov’s  theorem  (see  Mikosch  1998),  if  we  define  Q  by 

Q(A)  =  j  exp  (-fLllBiT)  -  ^  0  dU,  (7.6.18) 

then,  on  the  new  probability  space  (£2,  Q),  B  is  standard  Brownian  motion.  A 

simple  calculation  using  (7.6.17)  then  shows  that  the  discounted  price  process  P  is 
a  5-martingale  on  (£2,  Q),  i.e.  EQPit)  <  oo  and  (7.6.15)  holds. 

Assuming  the  existence  of  a  portfolio  (7.6.4)  which  satisfies  the  self-financing 
condition  (7.6.5)  and  the  boundary  condition  (7.6.6),  the  discounted  portfolio  value  is 

Vit)  =  e~rtVit).  (7.6.19) 

Applying  Ito’s  formula  to  this  expression  we  obtain 

d  Vit)  =  e~rt  i~rV  it)dt  +  d  Vit))  =  ate~rt  i~rPit)dt  +  d Pit))  =  atdPit), 

and  hence,  from  (7.6.16), 

Vit)  =  V(0)  +  f  a5dP(j)  =  V(0)  +  or  /  asPis)ABis).  (7.6.20) 

JO  JO 

Since  atPit)  is  a  function  of  {5(5’),  s  <  t)  for  each  t  e  [0,  T\  and  since,  under  the 
probability  measure  Q ,  B  is  Brownian  motion  and  Bit)  is  a  function  of  {5(5’),  s  <  t] 
for  each  t  e  [0,  5],  we  conclude  that  V  is  a  5-martingale.  Hence 

V(t)=EQ[V(T)\B(s),s<t],  te[0,T], 

and 

vit)  =  er'V(t)  =  £G[£>“r(T“r)/i(P(r))|B(5),  5  <  t],  (7.6.21) 

where  /z(5(5))  is  the  value  of  the  option  at  time  T.  For  the  European  call  option 
h(P(T))  =  max (5 (5)  -  K ,  0). 

It  only  remains  to  calculate  v(f  x)  from  (7.6.21).  To  do  this  we  define  6  \=  T  —  t. 
Then,  expressing  5(5)  in  terms  of  5(0, 

vit)  =  EQ[e~r0h{P{t)e{r~a^)0+a(B{T)~'m)) \B(s),  s  <  t]  =  v(t,  Pit)), 

where 

vit,x)  =  e~r0  J  hixe^r~~^0^ay0l/2^)(j)iy)dy  (7.6.22) 

and  <p  is  the  standard  normal  density  function, 
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<t>(y)  =  — =exp(-y2/2). 


1 


v27T 


Substituting  max(x  —  K ,  0)  for  h{x)  in  (7.6.22)  gives 


v(t,x)  =x<$>(z\)  -  Ke  r{T  0O(z2), 


(7.6.23) 


where  O  is  the  standard  normal  cumulative  distribution  function,  <£>  (x)  =  f*  <p  ( u)du , 


The  value  of  the  option  at  time  0  is  V(0)  =  v(0,  P(0))  and  the  investment  strategy 
{at,  bt,  0  <  t  <  T)  required  to  hedge  it  is  determined  by  the  relations  at  —  | ^(t,  P(t )) 
and  bt  =  (v(t,  P(t)  —  atP(t))/D(t).  It  can  be  verified  by  direct  substitution  (Problem 
7.12)  that  the  function  v  given  by  (7.6.23)  satisfies  the  partial  differential  equation 
(7.6.13)  and  the  boundary  condition  (7.6.14). 

The  quantity  m  =  (/z  —  r)/ a  which  appears  in  the  integrand  in  (7.6.18)  is  called 
the  market  price  of  risk  and  represents  the  excess,  in  units  of  a,  of  the  instantaneous 
rate  of  return  /z  of  the  risky  asset  S  over  that  of  the  risk-free  asset  D.  If  m  =  0  then 
Q  =  II  and  the  model  is  said  to  be  risk-neutral. 

Although  the  model  (7.6.1)  has  many  shortcomings  as  a  representation  of  asset 
prices,  the  remarkable  achievement  of  Black,  Scholes  and  Merton  in  using  it  to  derive 
a  unique  arbitrage-free  option  price  has  inspired  enormous  interest  and  progress  in 
the  field  of  financial  mathematics.  As  a  result  of  their  pioneering  work,  research 
in  continuous-time  financial  models  has  blossomed,  with  much  of  it  directed  at 
the  construction,  estimation  and  analysis  of  more  realistic  continuous-time  models  for 
the  evolution  of  stock  prices,  and  the  pricing  of  options  based  on  such  models.  A  nice 
account  of  option-pricing  for  a  broad  class  of  Levy-driven  stock-price  models  can  be 
found  in  the  book  of  Schoutens  (2003). 


Problems 


7.1  Evaluate  EZ f  for  the  ARCH(l)  process  (7.2.5)  with  0  <  o^i  <  1  and  {et}  ~ 
IID  N(0,  1).  Deduce  that  EXf  <  oo  if  and  only  if  ?>a\  <  1. 

7.2  Let  {Zt}  be  a  causal  stationary  solution  of  the  ARCH (p)  equations  (7.2.1)  and 
(7.2.2)  with  EZ?  <  oo.  Assuming  that  such  a  process  exists,  show  that  Yt  = 


Z?  I  a  o  satisfies  the  equations 


\  i=  1  / 

and  deduce  that  {Yt}  has  the  same  autocorrelation  function  as  the  AR (/?)  process 

p 


{et}  -  WN(0,  1). 


i=  1 


(In  the  case  p  =  1,  a  necessary  and  sufficient  condition  for  existence  of  a  causal 
stationary  solution  of  (7.2.1)  and  (7.2.2)  with  EZ?  <  oo  is  3c^  <  1,  as  shown 
by  the  results  of  Section  7.2  and  Problem  7.1.) 
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7.3  Suppose  that  {Zt}  is  a  causal  stationary  GARCH(p,  q)  process  Zt  —  +Jhtet,  where 
{et}  ~  IID(0,1),  Eii  at  +  Ej=i  Bj  <  1  and 

ht  =  O'o  +  Oi\Z^_x  +  •  •  •  +  OtpZ^_p  +  /3\ht-i  +  •  •  •  +  f5qht—q . 

a.  Show  that  E(Z2\Z2_V  Z2_2,  . . .)  =  /zr. 

b.  Show  that  the  squared  process  {Z2}  is  an  ARMA(m,  q)  process  satisfying  the 
equations 

Z2  =  ao  +  (oq  +  Pi  )Z2_1  +  •  •  •  +  (am  +  pm)Z2_m 
+  Ut  —  P\Ut-\ - 

where  ///  =  max{p,  ^},  otj  —  0  for  j  >  p ,  fy  —  0  for  j  >  and  t/r  =  Z2  —  A 
is  white  noise  if  EZf  <  oo. 

c.  For p  >  1,  show  that  the  conditional  variance  process  {/zj  is  an 
ARMA(m,/z  —  1)  process  satisfying  the  equations 

ht  —  a0  +(oc  i  +  fi\)ht-\  +  •  •  •  +  ( am  +  fim)ht-m 

~\~Vt  +  (x*Vt-i  +  ■  •  •  +  a*Vt-p- 1, 

where  and  a*  —  oij+i/ai  for  j  =  1, . . .  ,p  —  1. 

7.4  To  each  of  the  seven  components  of  the  multivariate  time  series  filed  as 
STOCK7.TSM,  fit  an  ARMA  model  driven  by  GARCH  noise.  Compare 
the  fitted  models  for  the  various  series  and  comment  on  the  differences. 
(For  exporting  components  of  a  multivariate  time  series  to  a  univariate  project, 
see  the  topic  Getting  started  in  the  PDF  file  ITSM_HELP  which  is  included  in 
the  ITSM  software  package. 

7.5  Verify  equation  (7.3.7). 

7.6  Show  that  the  return,  ZA(t)  \ogP(t)  —  log P(t  —  A ),  approximates  the 
fractional  gain,  FA(t)  :=  (P(t)  —  P(t  —  A))/P(t  —  A ),  in  the  sense  that 

->•  1  as  FA(t)  ->  0. 

Fa(0 

7.7  For  the  GBM  model  (7.5.7)  with  P(0)  =  1,  evaluate  the  mean  and  variance  of 
P(t)  and  the  mean  and  variance  of  the  return,  ZA(t). 

7.8  If  h  is  any  second-order  stationary  non-negative  volatility  process  with  mean  £, 
variance  co2  and  autocorrelation  function  p,  verify  the  relations  (7.5. 16)-(7.5. 18). 

7.9  Use  (7.5.26)  and  (7.5.27)  to  evaluate  the  mean  and  autocovariance  function  of 
the  stationary  Ornstein-Uhlenbeck  process  (7.5.23). 

7.10  If  h  is  the  stationary  Ornstein-Uhlenbeck  process  (7.5.23)  and  s  is  any  fixed 
value  in  [0,  A],  show  that  application  of  the  operator  <fi(B)  (1  —  eXAB)  to 
the  sequence  {h(nA  +  s),n  e  Z}  gives 

(p(B)h(nA  +  s)  =  Wn(s), 

where  [Wn(s),  n  e  Z}  is  the  iid  sequence, 

rnA+s 

Wn(s)  =  /  eUnA +s~u)dL(u). 

J  (n—l)A+s 
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Deduce  that  the  integrated  volatility  sequence,  In  —  f°A  h(nA  +  s)ds ,  satisfies 

(1  -  e/AB)I„  =  /  Wn(s)ds. 

J-A 

Since  the  right-hand  side  is  1-correlated,  it  follows  from  Proposition  2.1.1  that 
it  is  an  MA(1)  process  and  hence  that  the  integrated  volatility  sequence  is  an 
ARM A(  1,1)  process. 

7.11  For  the  stochastic  volatility  model  (7.5.13)  with  m  =  b  =  0  and  second-order 
stationary  volatility  process  h  independent  of  W ,  establish  (7.5.20)  and  (7.5.21). 

7.12  Verify  that  the  expression  (7.6.23)  for  v(t,  s )  satisfies  (7.6.13)  and  (7.6.14)  and 
use  it  to  write  down  the  value  of  the  option  at  time  t  =  0  and  the  corresponding 
investment  strategy  {( at ,  bt ),  0  <  t  <  T}. 


Multivariate  Time  Series 


8.1  Examples 

8.2  Second-Order  Properties  of  Multivariate  Time  Series 

8.3  Estimation  of  the  Mean  and  Covariance  Function 

8.4  Multivariate  ARMA  Processes 

8.5  Best  Linear  Predictors  of  Second-Order  Random  Vectors 

8.6  Modeling  and  Forecasting  with  Multivariate  AR  Processes 

8.7  Cointegration 


Many  time  series  arising  in  practice  are  best  considered  as  components  of  some  vector¬ 
valued  (multivariate)  time  series  {Xr}  having  not  only  serial  dependence  within  each 
component  series  { Xti }  but  also  interdependence  between  the  different  component 
series  {. Xti }  and  {Xtj},  i  /  j.  Much  of  the  theory  of  univariate  time  series  extends  in 
a  natural  way  to  the  multivariate  case;  however,  new  problems  arise.  In  this  chapter 
we  introduce  the  basic  properties  of  multivariate  series  and  consider  the  multivariate 
extensions  of  some  of  the  techniques  developed  earlier.  In  Section  8.1  we  introduce 
two  sets  of  bivariate  time  series  data  for  which  we  develop  multivariate  models  later 
in  the  chapter.  In  Section  8.2  we  discuss  the  basic  properties  of  stationary  multivariate 
time  series,  namely,  the  mean  vector  fi  —  EXt  and  the  covariance  matrices  F(h)  = 
E(Xt+hX't )  —  fill' ,  h  —  0,  =bl,  d=2,  . . .,  with  reference  to  some  simple  examples, 
including  multivariate  white  noise.  Section  8.3  deals  with  estimation  of  fi  and  r(-) 
and  the  question  of  testing  for  serial  independence  on  the  basis  of  observations  of 
Xi,  . . . ,  Xn.  In  Section  8.4  we  introduce  multivariate  ARMA  processes  and  illustrate 
the  problem  of  multivariate  model  identification  with  an  example  of  a  multivariate 
AR(I)  process  that  also  has  an  MA(1)  representation.  (Such  examples  do  not  exist 
in  the  univariate  case.)  The  identification  problem  can  be  avoided  by  confining 
attention  to  multivariate  autoregressive  (or  VAR)  models.  Forecasting  multivariate 
time  series  with  known  second-order  properties  is  discussed  in  Section  8.5,  and  in 
Section  8.6  we  consider  the  modeling  and  forecasting  of  multivariate  time  series 
using  the  multivariate  Yule-Walker  equations  and  Whittle’s  generalization  of  the 
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Durbin-Levinson  algorithm.  Section  8.7  contains  a  brief  introduction  to  the  notion 
of  cointegrated  time  series. 


8.1  Examples 


In  this  section  we  introduce  two  examples  of  bivariate  time  series.  A  bivariate  time 
series  is  a  series  of  two-dimensional  vectors  (Xt\,  X a)'  observed  at  times  t  (usually 
t  —  1,  2,  3,  . . .).  The  two  component  series  {Xri}  and  {Xt2}  could  be  studied  indepen¬ 
dently  as  univariate  time  series,  each  characterized,  from  a  second-order  point  of  view, 
by  its  own  mean  and  autocovariance  function.  Such  an  approach,  however,  fails  to  take 
into  account  possible  dependence  between  the  two  component  series,  and  such  cross¬ 
dependence  may  be  of  great  importance,  for  example  in  predicting  future  values  of  the 
two  component  series. 

We  therefore  consider  the  series  of  random  vectors  Xt  =  (Xt\,  Xa)'  and  define 
the  mean  vector 


EXt  i 
EXa 


and  covariance  matrices 


r (t  +  h,  t)  :=  Cov(Xt+h,  Xr) 


cov(Xt+hA,Xn )  cov(Xt+h'i,Xt2) 
cov(Xt+h'2,Xtl )  cov(Xt+h'2,  Xl2) 


The  bivariate  series  lx, I  is  said  to  be  (weakly)  stationary  if  the  moments  fLt  and 
V(t  +  h,  t)  are  both  independent  of  t,  in  which  case  we  use  the  notation 


and 


EXt  i 
EXt2 


r(h)  =  Cov(Xt+h,Xt) 


Vi  i  (h)  Yn(h) 
Yi\  (h)  Y22O1) 


The  diagonal  elements  are  the  autocovariance  functions  of  the  univariate  series  {X?i} 
and  {Xa}  as  defined  in  Chapter  2,  while  the  off-diagonal  elements  are  the  covariances 
between Xt+h,i  and Xtj,  i  ^  j.  Notice  that  ynih)  =  y2i(—h). 

A  natural  estimator  of  the  mean  vector  fi  in  terms  of  the  observations  X\, ...  ,Xn 
is  the  vector  of  sample  means 


and  a  natural  estimator  of  f  ( h )  is 


r  (h)  = 


n—h 

y](x/+/i-x„)  (x; 

r(-hy 


Xn)  for  0  <  h  <  n  —  1, 

for  —  n  +  1  <  h  <  0. 


The  correlation  Pij(h)  between  Xt+h,i  and  Xtj  is  estimated  by 
P,j(h)  =  YijmM0)Yjj(0)r1/2. 

If  i  =  j,  then  ptj  reduces  to  the  sample  autocorrelation  function  of  the  /th  series. 
These  estimators  will  be  discussed  in  more  detail  in  Section  8.2. 
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Figure  8-1 

The  Dow  Jones  Index 
(top)  and  Australian 
All  Ordinaries  Index 
{bottom)  at  closing  on 
251  trading  days  ending 
August  26th,  1 994 

Example  8.1.1 


Dow  Jones  and  All  Ordinaries  Indices;  DJA02.TSM 

Figure  8-1  shows  the  closing  values  Do,  . . . ,  Aso  of  the  Dow  Jones  Index  of  stocks  on 
the  New  York  Stock  Exchange  and  the  closing  values  A0,  . . . ,  A250  of  the  Australian 
All  Ordinaries  Index  of  Share  Prices,  recorded  at  the  termination  of  trading  on  251 
successive  trading  days  up  to  August  26th,  1994.  (Because  of  the  time  difference 
between  Sydney  and  New  York,  the  markets  do  not  close  simultaneously  in  both 
places;  however,  in  Sydney  the  closing  price  of  the  Dow  Jones  index  for  the  previous 
day  is  known  before  the  opening  of  the  market  on  any  trading  day.)  The  efficient 
market  hypothesis  suggests  that  these  processes  should  resemble  random  walks  with 
uncorrelated  increments.  In  order  to  model  the  data  as  a  stationary  bivariate  time  series 
we  first  reexpress  them  as  percentage  relative  price  changes  or  percentage  returns 
(filed  as  DJAOPC2.TSM) 

(A  -  A-i) 

xtl  =  ioov  '  ,  t=  1, . . . , 250, 

A-i 

and 

(At  —  A*_i) 

X,2  =  100— - — ,  t  =  l, ,  250. 

At- 1 

The  estimators  p\\(h)  and  /022(h)  of  the  autocorrelations  of  the  two  univariate  series 
are  shown  in  Figures  8-2  and  8-3.  They  are  not  significantly  different  from  zero. 

To  compute  the  sample  cross-correlations  pn(h)  and  p2\(h)  using  ITSM,  select 
File>Proj  ect >Open>Multivariate.  Then  click  OK  and  double-click  on 
the  file  name  DJAOPC2.TSM.  You  will  see  a  dialog  box  in  which  Number  of 
columns  should  be  set  to  2  (the  number  of  components  of  the  observation  vectors). 
Then  click  OK,  and  the  graphs  of  the  two  component  series  will  appear.  To  see  the 
correlations,  press  the  middle  yellow  button  at  the  top  of  the  ITSM  window.  The 
correlation  functions  are  plotted  as  a  2  x  2  array  of  graphs  with  pn(h),  /012(h)  in  the 
top  row  and  p2\(h),  /022(h)  in  the  second  row.  We  see  from  these  graphs  (shown  in 
Figure  8-4)  that  although  the  autocorrelations  pu(h ),  i  —  1,2,  are  all  small,  there  is 
a  much  larger  correlation  between  Xt_\\  and  Xt^-  This  indicates  the  importance  of 
considering  the  two  series  jointly  as  components  of  a  bivariate  time  series.  It  also 
suggests  that  the  value  of  Xt_\\,  i.e.,  the  Dow  Jones  return  on  day  t  —  1,  may  be  of 
assistance  in  predicting  the  value  of  Xt^,  the  All  Ordinaries  return  on  day  t.  This  last 
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Figure  8-2 

The  sample  ACF  p\\  of  the 
observed  values  of  f X ti }  in 
Example  8.1 .1 ,  showing  the 
bounds  ±1 .96n-1/2 


Figure  8-3 

The  sample  ACF  p22  of  the 
observed  values  of  {Xt2}  in 
Example  8.1 .1 ,  showing 
the  bounds  ±1 .96 n-1/2 


Example  8.1.2 


0  10  20  30  40 

Lag 


observation  is  supported  by  the  scatterplot  of  the  points  xt^2),  t  =  2,  ,  250, 

shown  in  Figure  8-5. 


□ 


Sales  with  a  leading  indicator;  LS2.TSM 

In  this  example  we  consider  the  sales  data  {Yt2,  t  =  l, ,  150}  with  leading  indicator 
{Yti,  t  =  1,  . . . ,  150}  given  by  Box  and  Jenkins  (1976,  p.  537).  The  two  series  are 
stored  in  the  ITSM  data  files  SALES. TSM  and  LEAD.TSM,  respectively,  and  in 
bivariate  format  as  LS2.TSM.  The  graphs  of  the  two  series  and  their  sample  autocorre¬ 
lation  functions  strongly  suggest  that  both  series  are  nonstationary.  Application  of  the 
operator  (1  —  B)  yields  the  two  differenced  series  [Dt i}  and  {Dt2},  whose  properties 
are  compatible  with  those  of  low-order  ARM  A  processes.  Using  ITSM,  we  find  that 
the  models 
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Series  1  x  Series  2 


Figure  8-4 

The  sample  correlations 
pjj(h)  between  and  Xt  j 

for  Example  8.1 .1 .  (pjj(h)  is 
plotted  as  the yth  graph  in 
the  /th  row,  i,j  =  1,2.  Series 
1  and  2  consist  of  the  daily 
Dow  Jones  and  All 
Ordinaries  percentage 
returns,  respectively.) 


Series  2  x  Series  1 


Series  2 


Figure  8-5 

Scatterplot  of  ,  xt^)f 

t  =  2,  . . . ,  250,  for  the 
data  in  Example  8.1 .1 


Dt t-0.0228  =  Zt\  —  0.474Zf_i5i,  {Ztl}  ~  WN(0,  0.0779),  (8.1.1) 

Da  -  0.838A-T2  -  0.0676  =  Z,2  -  0.610Z,_U, 

{Z^-WNCO,  1.754),  (8.1.2) 

provide  good  fits  to  the  series  {Dt{\  and  {A2}- 

The  sample  autocorrelations  and  cross-correlations  of  {Dt{\  and  {Da},  are  com¬ 
puted  by  opening  the  bivariate  ITSM  file  LS2.TSM  (as  described  in  Example  8.1.1). 
The  option  Transf  orm>Dif  f  erence,  with  differencing  lag  equal  to  1,  generates 
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Series  1 


Series  1  x  Series  2 


Figure  8-6 

The  sample  correlations 
Pij(h)  of  the  series  {Dt- \ }  and 
{Df2 }  of  Example  8.1 .2, 
showing  the  bounds 
±1 .96n-1/2.  (pij(h)  is 
plotted  as  the y'th  graph  in 
the  /th  row,  i,j  =  1,2.) 


Series  2  x  Series  1 


Series  2 


the  bivariate  differenced  series  {(Ai>  Da)},  and  the  correlation  functions  are  then 
obtained  as  in  Example  8.1.1  by  clicking  on  the  middle  yellow  button  at  the  top  of  the 
ITSM  screen.  The  sample  auto-  and  cross-correlations  Pij(h),  i ,  j  —  1,2,  are  shown 
in  Figure  8-6.  As  we  shall  see  in  Section  8.3,  care  must  be  taken  in  interpreting  the 
cross-correlations  without  first  taking  into  account  the  autocorrelations  of  {Ail  and 

{Dt2}. 

□ 


8.2  Second-Order  Properties  of  Multivariate  Time  Series 


Consider  m  time  series  { Xti ,  t  —  0,  d=l, ...,},  /  =  1, . . . ,  m,  with  EX°~  <  OO  for  all 
t  and  /.  If  all  the  finite-dimensional  distributions  of  the  random  variables  {Xr/}  were 
multivariate  normal,  then  the  distributional  properties  of  {. Xti }  would  be  completely 
determined  by  the  means 

PH  :=  EXtl  (8.2.1) 

and  the  covariances 

Yijit  +  h,  t )  :=  E[(Xt+hJ  -  iiti) (Xtj  -  (8.2.2) 

Even  when  the  observations  {Xr;}  do  not  have  joint  normal  distributions,  the  quantities 
liti  and  Yij(t  +  h,  t)  specify  the  second-order  properties,  the  covariances  providing  us 
with  a  measure  of  the  dependence,  not  only  between  observations  in  the  same  series, 
but  also  between  the  observations  in  different  series. 

It  is  more  convenient  in  dealing  with  m  interrelated  series  to  use  vector  notation. 
Thus  we  define 
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Definition  8.2.1 


X,  := 


X: 


a 


X, 


tm 


t  —  0,  d=l, 


(8.2.3) 


The  second-order  properties  of  the  multivariate  time  series  {Xr}  are  then  specified  by 
the  mean  vectors 


li,  :=  EX,  = 


M/1 


Mzm 


(8.2.4) 


and  covariance  matrices 


T  (t  T  h,  t)  \ — 


where 


Yn(t  +  K  t)  •  •  •  Yimit  +  h,  t) 


Yml(t  +  h,t)  •  •  •  Ymm(t  +  h ,  t) 


(8.2.5) 


Yij(t  +  h,  t )  :=  Co\(Xt+hj,  XtJ). 


Remark  1.  The  matrix  F(t  +  h,  t)  can  also  be  expressed  as 
T(f  +  K  t)  :=  £[(X^  -  #i,+a)(X,  -  At,)'], 

where  as  usual,  the  expected  value  of  a  random  matrix  A  is  the  matrix  whose 
components  are  the  expected  values  of  the  components  of  A.  □ 


As  in  the  univariate  case,  a  particularly  important  role  is  played  by  the  class  of 

multivariate  stationary  time  series,  defined  as  follows. 


The  m-variate  series  {X t)  is  (weakly)  stationary  if 

(i)  /£x(0  is  independent  of  t 
and 

(ii)  Tx(t  +  h,  t)  is  independent  of  t  for  each  h. 


For  a  stationary  time  series  we  shall  use  the  notation 

Mi 


/r  :=  EX,  = 


Mm 


and 


r (h)  :=  E[(Xt+h  —  M)(Xf  —  m)']  — 


XnW  ••• 

•  • 

Yhn(J0 

• 

•  • 

•  • 

Yml  (JO 

• 

• 

Y mm  (JO 

(8.2.6) 


(8.2.7) 


We  shall  refer  to  /jl  as  the  mean  of  the  series  and  to  T  (/z)  as  the  covariance  matrix  at 
lag  h.  Notice  that  if  {X;}  is  stationary  with  covariance  matrix  function  r(-),  then  for 
each  /,  {A//}  is  stationary  with  covariance  function  YaO)-  The  function  y^(-),  i  ^ is 
called  the  cross-covariance  function  of  the  two  series  {Xr/}  and  {X^}.  It  should  be  noted 
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Example  8.2.1 


Proof 


that  '/ij ( ■ )  is  not  in  general  the  same  as  y,, ( ■ ) •  The  correlation  matrix  function  R(-)  is 
defined  by 


Rih) 


P\\(h) 


Pm !  ifr) 


P\m  (h) 


Pmm (n) 


(8.2.8) 


where  p,:/(/i)  =  yy(/i)/[y„(0)yy)(0)]1/2.  The  function  i?()  is  the  covariance  matrix 
function  of  the  normalized  series  obtained  by  subtracting  /x  from  X,  and  then  dividing 
each  component  by  its  standard  deviation. 


Consider  the  bivariate  stationary  process  {Xr}  defined  by 
X,i  =  Zt, 

Xt2  =  Zt  +  0.75Zr_io, 


where  {Z,}  ~  WN(0,  1).  Elementary  calculations  yield  fi  =  0, 

0  0 
0.75  0.75 

and  T  (  /)  =  0  otherwise.  The  correlation  matrix  function  is  given  by 


T(-10)  = 


0  0.75 
0  0.75 


F(0)  = 


1  1 
1  1.5625 


T(10)  = 


0  0 
0.60  0.48 

and  R(  j)  =  0  otherwise. 


7?(— 10)  = 


0  0.60 
0  0.48 


R(  0)  = 


1  0.8 
0.8  1 


fl(10)  = 


□ 


Basic  Properties  of  T(  ): 


1. 

2. 

3. 

4. 


T(h)  =  T'(—h), 

\Yij(h)\  <  \y„(0)yjj(0)\'/2,  i,  j,  =  1,  . . . ,  m, 

Yu(-)  is  an  autocovariance  function,  i  —  l,  ...  ,m,  and 
J2j.k=\  ajr U  ~  k)ak  >  0  for  all  n  e  {1,2,.. .}  and  ai, 


•  •  • 


a„  €  R 


m 


The  first  property  follows  at  once  from  the  definition,  the  second  from  the  fact  that 
correlations  cannot  be  greater  than  one  in  absolute  value,  and  the  third  from  the 
observation  that  yu(-)  is  the  autocovariance  function  of  the  stationary  series  {Xti,  t  = 
0,  ±1,  . .  .}•  Property  4  is  a  statement  of  the  obvious  fact  that 


Remark  2.  The  basic  properties  of  the  matrices  V  ( h )  are  shared  also  by  the  cor¬ 
responding  matrices  of  correlations  R(h)  —  \Pij{h)  ]f/=] ,  which  have  the  additional 
property 

Pu( 0)  =  1  for  all  i. 

The  correlation  0)  is  the  correlation  between  Xti  and  Xtj ,  which  is  generally  not 
equal  to  1  if  i  /  j  (see  Example  8.2.1).  It  is  also  possible  that  \  Yij{h)\  >  |yy(0)|ifi#7 
(see  Problem  7.1).  □ 
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The  simplest  multivariate  time  series  is  multivariate  white  noise,  the  definition  of 
which  is  quite  analogous  to  that  of  univariate  white  noise. 


Definition  8.2.2. 


The  m-variate  series  {Zt}  is  called  white  noise  with  mean  0  and  covariance 
matrix  If,  written 

{Zt}  -  WN(0,  5f),  (8.2.9) 


if  {Zt}  is  stationary  with  mean  vector  0  and  covariance  matrix  function 


r(h)  = 


ifh  =  0, 
otherwise. 


(8.2.10) 


Definition  8.2.3. 


The  m-variate  series  { Zt }  is  called  iid  noise  with  mean  0  and  covariance  matrix 
written 

{Zt}  ~  iid(0,  If),  (8.2.11) 

if  the  random  vectors  {Zt}  are  independent  and  identically  distributed  with  mean  0 
and  covariance  matrix  If . 


Multivariate  white  noise  {Ztj  is  used  as  a  building  block  from  which  can  be 
constructed  an  enormous  variety  of  multivariate  time  series.  The  linear  processes  are 
generated  as  follows. 


Definition  8.2.4. 


The  m-variate  series  {X^}  is  a  linear  process  if  it  has  the  representation 

oo 

Z t-j,  {Zt}  ~  WN(0,  If),  (8.2.12) 

j=-oo 

where  {Cj}  is  a  sequence  of  m  x  m  matrices  whose  components  are  absolutely 
summable. 


x,=  £  Cj 


The  linear  process  (8.2. 12)  is  stationary  (Problem  7.2)  with  mean  0  and  covariance 
function 

oo 

T(h)  =  Cj+h^Cj,  h  =  0,±1, ....  (8.2.13) 

j=-oo 


An  MA(oo)  process  is  a  linear  process  with  Cj  =  0  for  j  <  0.  Thus  {Xt}  is  an 
MA(oo)  process  if  and  only  if  there  exists  a  white  noise  sequence  { Zt }  and  a  sequence 
of  matrices  Cj  with  absolutely  summable  components  such  that 


oo 


X,  =  E  CjZH. 

j= 0 


Multivariate  ARM  A  processes  will  be  discussed  in  Section  8.4,  where  it  will  be  shown 
in  particular  that  any  causal  ARMA (p,  q)  process  can  be  expressed  as  an  MA(oo) 
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process,  while  any  invertible  ARMA (p,  q)  process  can  be  expressed  as  an  AR(oo) 
process,  i.e.,  a  process  satisfying  equations  of  the  form 

oo 

xf+y>xf_i  =  z„ 

7=1 

in  which  the  matrices  Aj  have  absolutely  summable  components. 


8.2.1  Second-Order  Properties  in  the  Frequency  Domain 


Provided  that  the  components  of  the  covariance  matrix  function  T(-)  have  the  property 
Et-oo  Warn  <  oo,  ij  —  1 , . . . ,  m,  then  V  has  a  matrix-valued  spectral  density 
function 


m 


i 

2n 


e~iXhT{h), 

h=  —  (X) 


—71  <  A  <  TV, 


and  T  can  be  expressed  in  terms  off  as 
T  (70=  f  eiXhf(X)dX. 

J  —  7T 

The  second-order  properties  of  the  stationary  process  {Xt}  can  therefore  be  described 
equivalently  in  terms  of /(•)  rather  than  T(-).  Similarly,  {X?}  has  a  spectral  represen¬ 
tation 

Xt=  f  eiXtdZ(X), 

J  —7T 

where  {Z(A),  —  n  <  X  <  n)  is  a  process  whose  components  are  complex-valued 
processes  satisfying 

fjk(X)d\  if  X  = /ii, 

0  if  A  /I, 


and  Zk  denotes  the  complex  conjugate  of  Zk.  We  shall  not  go  into  the  spectral 
representation  in  this  book.  For  details  see  Brockwell  and  Davis  (1991). 


8.3  Estimation  of  the  Mean  and  Covariance  Function 

As  in  the  univariate  case,  the  estimation  of  the  mean  vector  and  covariances  of  a 
stationary  multivariate  time  series  plays  an  important  role  in  describing  and  model¬ 
ing  the  dependence  structure  of  the  component  series.  In  this  section  we  introduce 
estimators,  for  a  stationary  m-variate  time  series  {XJ,  of  the  components  / ij ,  Yij(h),  and 
PijQi)  of  /a ,  T(/i),  and  R(h),  respectively.  We  also  examine  the  large-sample  properties 
of  these  estimators. 


8.3.1  Estimation  of  il 

A  natural  unbiased  estimator  of  the  mean  vector  fi  based  on  the  observations 
Xi,  . . . ,  Xn  is  the  vector  of  sample  means 


•  •  • 
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Proposition  8.3.1. 


1 

n 


n 


t=  1 


The  resulting  estimate  of  the  mean  of  the  jth  time  series  is  then  the  univariate  sample 
mean  (1  /n)  Yft=\  ^tj-  If  each  °f  the  univariate  autocovariance  functions  ya(-),  i  — 
1 ,  ,m,  satisfies  the  conditions  of  Proposition  2.4.1,  then  the  consistency  of  the 

estimator  Xn  can  be  established  by  applying  the  proposition  to  each  of  the  component 
time  series  {Xti}.  This  immediately  gives  the  following  result. 


If  {X, }  is  a  stationary  multivariate  time  series  with  mean  fi  and  covariance  function 
r(-),  then  as  n  ->  oo, 

E  (X„  -  ft)'  (X„  -/»)->  0  if  yu(n)  -»  0,  1  <  i  <  m, 

and 

m  oo  oo 

nE (x„  -  ft)r  (x„  -  /x)  ^  yy  yaw  ^  yy  wawi  <  oo,  i  <  /  <  m. 

i=  1  h=—o o  h=—o o 


Under  more  restrictive  assumptions  on  the  process  {Xf}  it  can  also  be  shown  that 
Xn  is  approximately  normally  distributed  for  large  n.  Determination  of  the  covariance 
matrix  of  this  distribution  would  allow  us  to  obtain  confidence  regions  for  ji.  However, 
this  is  quite  complicated,  and  the  following  simple  approximation  is  useful  in  practice. 

For  each  i  we  construct  a  confidence  interval  for  /x7-  based  on  the  sample  mean  Xt 
of  the  univariate  series  Xu, . . . ,  Xti  and  combine  these  to  form  a  confidence  region  for 
ft.  If  fi(co)  is  the  spectral  density  of  the  ith  process  {Xr;}  and  if  the  sample  size  n  is 
large,  then  we  know,  under  the  same  conditions  as  in  Section  2.4,  that  *fn  (Xf  —  /x7)  is 
approximately  normally  distributed  with  mean  zero  and  variance 

oo 

2nfi(0)  =  J2  U/(£). 

k=—o o 


It  can  also  be  shown  (see,  e.g.,  Anderson  1971)  that 


2tt /,(0)  := 

\h\<r 


is  a  consistent  estimator  of  litfif)),  provided  that  r  =  rn  is  a  sequence  of  numbers 
depending  on  n  in  such  a  way  that  rn  ->  oo  and  rn/n  0  as  n  ->  oo.  Thus  if  Xt 
denotes  the  sample  mean  of  the  ith  process  and  &a  is  the  a -quantile  of  the  standard 
normal  distribution,  then  the  bounds 

Xi  ±  (Pi-a/2  [in  f  iiO) / ikj 


are  asymptotic  (1  —  a)  confidence  bounds  for  /r7.  Hence 


p{\!M  -  Xi\  <  (Pi-a/2  (2 7Z  fi(0)/nj  ,  i  =  1 , . . 


> 


>  @1-01/2 


fi(0)/n) 
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Theorem  8.3.1. 


where  the  right-hand  side  converges  to  1  —  ma  as  n  ->  oo.  Consequently,  as  n  ->  oo, 
the  set  of  m- dimensional  vectors  bounded  by 

|x/  =  X[  =b  ^l-(of/(2 m))  (27r/;(0)/«)1/2,/=  1 . ml  (8.3.1) 

has  a  confidence  coefficient  that  converges  to  a  value  greater  than  or  equal  to  l  —  a 
(and  substantially  greater  if  m  is  large).  Nevertheless,  the  region  defined  by  (8.3.1)  is 
easy  to  determine  and  is  of  reasonable  size,  provided  that  m  is  not  too  large. 


8.3.2  Estimation  of  T(h) 


As  in  the  univariate  case,  a  natural  estimator  of  the  covariance  V(h)  =  E[(X 

t+h, 

-  fi)']  is 


for  0  <  h  <  —  1 , 
for  —  n  +  1  <  h  <  0. 


A 

Writing  yifh)  f°r  the  (ij) -component  of  T(/i),  ij  =  1,2,...,  we  estimate  the  cross¬ 
correlations  by 


MhXM0)Yjjm 


If  i  =  y,  then  reduces  to  the  sample  autocorrelation  function  of  the  ith  series. 

Derivation  of  the  large-sample  properties  of  and  p^  is  quite  complicated  in 
general.  Here  we  shall  simply  note  one  result  that  is  of  particular  importance  for  testing 
the  independence  of  two  component  series.  For  details  of  the  proof  of  this  and  related 
results,  see  Brockwell  and  Davis  (1991). 


Let  {XJ  be  the  bivariate  time  series  whose  components  are  defined  by 

oo 

X,1  =  akz,-k,  i ,  {z,  1 }  ~  HD  (0,  a?) , 

k= — oo 

and 

oo 

v2  =  J2  ^Zt-k. 2,  {Zt2}  ~  HD  (0,  a22) , 

k= — oo 


where  the  two  sequences  {Zt\}  and  {Ztf\  are  independent,  \cik 

Efc  \Pk\  <  oo- 


<  oo,  and 


Then  for  all  integers  h  and  k  with  h  k,  the  random  variables  nl^2p\2(h) 
and  nl/2p\2(k)  are  approximately  bivariate  normal  with  mean  0,  variance 
Y?=-oo  Pn(j)p2i(j)>  and  covariance  o  Pn(j)p22(j  +  k-h\  for  n  large. 

[For  a  related  result  that  does  not  require  the  independence  of  the  two  series  {X?i}  and 
{Xt2}  see  Bartlett’s  Formula,  Section  8.3.4  below.] 


Theorem  8.3.1  is  useful  in  testing  for  correlation  between  two  time  series.  If  one 
of  the  two  processes  in  the  theorem  is  white  noise,  then  it  follows  at  once  from  the 
theorem  that  pn(h)  is  approximately  normally  distributed  with  mean  0  and  variance 
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1  /n ,  in  which  case  it  is  straightforward  to  test  the  hypothesis  that  p  12(h)  =  0.  However, 
if  neither  process  is  white  noise,  then  a  value  of  p  12(h)  that  is  large  relative  to  n~l/2  does 
not  necessarily  indicate  that  pn(h)  is  different  from  zero.  For  example,  suppose  that 
{Xri}  and  {Xt2}  are  two  independent  AR(1)  processes  with  pn(h)  —  P22Q1)  —  0.8|/z|. 
Then  the  large-sample  variance  of  pu(h)  is  n~{  (l  +  2  ^^(0.64)*)  =  4.556 n~{ .  It 
would  therefore  not  be  surprising  to  observe  a  value  of  p  12(h)  as  large  as  3 n~l/2  even 
though  {X?i}  and  {Xa)  are  independent.  If  on  the  other  hand,  pn(h)  —  0.8|/?l  and 
P22Q1)  —  (—0.8 )|/z|,  then  the  large-sample  variance  of  pn(h)  is  0.2195n_1,  and  an 
observed  value  of  3 n~{/2  for  pn(h)  would  be  very  unlikely. 


8.3.3  Testing  for  Independence  of  Two  Stationary  Time  Series 

Since  by  Theorem  8.3.1  the  large-sample  distribution  of  pn(h)  depends  on  both  pn(-) 
and  >022(0?  any  test  for  independence  of  the  two  component  series  cannot  be  based 
solely  on  estimated  values  of  pn(h),  h  =  0,  ±1, . . .,  without  taking  into  account  the 
nature  of  the  two  component  series. 

This  difficulty  can  be  circumvented  by  “prewhitening”  the  two  series  before 
computing  the  cross-correlations  pn(h),  i.e.,  by  transforming  the  two  series  to  white 
noise  by  application  of  suitable  filters.  If  {X?i}  and  {Xt2}  are  invertible  ARM  A  (p,  q) 
processes,  this  can  be  achieved  by  the  transformations 


where  nj^ (z) / 9^ (z)  and  <p{l\  6 are  the  autoregressive  and  moving- 
average  polynomials  of  the  ith  series,  i  —  1,2. 

Since  in  practice  the  true  model  is  nearly  always  unknown  and  since  the  data  Xtj, 
t  <  0,  are  not  available,  it  is  convenient  to  replace  the  sequences  [Zti}  by  the  residuals 
{W^}  after  fitting  a  maximum  likelihood  ARM  A  model  to  each  of  the  component 
series  (see  (5.3.1)).  If  the  fitted  ARMA  models  were  in  fact  the  true  models,  the  series 
{Wti]  would  be  white  noise  sequences  for  i  —  1,2. 

To  test  the  hypothesis  Hq  that  {Xri}  and  ( X t2)  are  independent  series,  we  observe 
that  under  Hq,  the  corresponding  two  prewhitened  series  {Zt{\  and  {Z^}  are  also  inde¬ 
pendent.  Theorem  8.3.1  then  implies  that  the  sample  cross-correlations  p  12(h) ,  pn(k), 
h  /  k,  of  [Zt  1}  and  {Zt2}  are  for  large  n  approximately  independent  and  normally  dis¬ 
tributed  with  means  0  and  variances  n~l.  An  approximate  test  for  independence  can 
therefore  be  obtained  by  comparing  the  values  of  \pn(h)\  with  1.96 n_1/2,  exactly  as 
in  Section  5.3.2.  If  we  prewhiten  only  one  of  the  two  original  series,  say  {Xri},  then 
under  Hq  Theorem  8.3.1  implies  that  the  sample  cross-correlations  pn(h ),  pn(k ), 
h  7^  k,  of  {Zti}  and  { X a}  are  for  large  n  approximately  normal  with  means  0,  vari¬ 
ances  n~l  and  covariance  n~{ p22(k  —  h ),  where  P22G)  is  the  autocorrelation  function 
of  {Xt2\.  Hence,  for  any  fixed  h ,  pn(h)  also  falls  (under  Hq)  between  the  bounds 
±1.96 n~{/2  with  a  probability  of  approximately  0.95. 


The  sample  correlation  functions  p^(-),  i,j  =  1,2,  of  the  bivariate  time  series 
E731A.TSM  (of  length  n  =  200)  are  shown  in  Figure  8-7.  Without  taking  into 
account  the  autocorrelations  pu( •),  /  =  1,  2,  it  is  impossible  to  decide  on  the  basis  of 
the  cross-correlations  whether  or  not  the  two  component  processes  are  independent 
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Series  1 


Series  1  x  Series  2 


Figure  8-7 

The  sample  correlations 
of  the  bivariate  series 
E731A.TSM  of 
Example  8.3.1 ,  showing  the 
bounds  ±1 .96n-1/2 


Series  2  x  Series  1 


Series  2 


of  each  other.  Notice  that  many  of  the  sample  cross-correlations  Pij(h),  i  /  j,  lie 
outside  the  bounds  ±1.96n_1/2  =  ±0.139.  However,  these  bounds  are  relevant 
only  if  at  least  one  of  the  component  series  is  white  noise.  Since  this  is  clearly 
not  the  case,  a  whitening  transformation  must  be  applied  to  at  least  one  of  the 
two  component  series.  Analysis  using  ITSM  leads  to  AR(1)  models  for  each.  The 
residuals  from  these  maximum  likelihood  models  are  stored  as  a  bivariate  series  in 
the  file  E731B.TSM,  and  their  sample  correlations,  obtained  from  ITSM,  are  shown 
in  Figure  8-8.  All  but  two  of  the  cross-correlations  are  between  the  bounds  ±0.139, 
suggesting  by  Theorem  8.3.1  that  the  two  residual  series  (and  hence  the  two  original 
series)  are  uncorrelated.  The  data  for  this  example  were  in  fact  generated  as  two 
independent  AR(1)  series  with  0  =  0.8  and  a2  —  1. 

□ 


8.3.4  Bartlett's  Formula 

In  Section  2.4  we  gave  Bartlett’s  formula  for  the  large-sample  distribution  of  the 
sample  autocorrelation  vector  p  —  (p(l),  . . . ,  p(k))  of  a  univariate  time  series. 
The  following  theorem  gives  a  large-sample  approximation  to  the  covariances  of  the 
sample  cross-correlations  pn(h)  and  pnik)  of  the  bivariate  time  series  {XJ  under  the 
assumption  that  {X; }  is  Gaussian.  However,  it  is  not  assumed  (as  in  Theorem  8.3.1) 
that  {X?i}  is  independent  of  {X^}. 
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Figure  8-8 

The  sample  correlations 
of  the  bivariate  series  of 
residuals  E731  B.TSM, 
whose  components  are 
the  residuals  from  the  AR(1 ) 
models  fitted  to  each  of  the 
component  series  in 
E731A.TSM 


Figure  8-9 

The  sample  correlations 
of  the  whitened  series 

Wt+h'l  ar|d  Wt2  of 
Example  8.3.2,  showing  the 
bounds  d=1 .96rj-1/2 


Series  1  Series  1  x  Series  2 


Series  1 


Series  1  x  Series  2 
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Corollary  8.3.1. 


Example  8.3.2. 


Bartlett’s  Formula: 

If  {X,}  is  a  bivariate  Gaussian  time  series  with  covariances  satisfying 
IX- oo  \Yij(h)\  <  OO,  i,j  =1,2,  then 


OO 


lim  nCov(pi2(h),  pn(k))  =  V] 

n^oo  x  ^ 


j=-oo 


Pn(j)p22(j  +  k  —  h)  +  pnU  +  k)p2i (j  —  h) 


-Pn(h){pn(j)Pn(j  +  k)  +  pzi(j)p2\{j  ~  &)} 
- Pn(k){pn(j) PnU  +  h)  +  p22(j)p2\  (j  -  h)} 


+ Pn(h)  pn(k) 


^Pn(j)  +  PnU)  +  \p\2U) 


If  {Xr}  satisfies  the  conditions  for  Bartlett's  formula,  if  either  {Xt\}  or  {Xtf\  is  white 
noise,  and  if 

P  12(h)  =  0,  h<£  [a,  b\, 

then 

lim  nVi ar  (pn(h))  —  1,  h  £\a,b\. 

n^oo  v  7 

Sales  with  a  leading  indicator 

We  consider  again  the  differenced  series  {At}  and  { D tf\  of  Example  8.1.2,  for  which 
we  found  the  maximum  likelihood  models  (8.1.1)  and  (8.1.2)  using  ITSM.  The  resid¬ 
uals  from  the  two  models  (which  can  be  filed  by  ITSM)  are  the  two  “whitened”  series 
{Wri}  and  {Wa}  with  sample  variances  0.0779  and  1.754,  respectively.  This  bivariate 
series  is  contained  in  the  file  E732.TSM. 

The  sample  auto-  and  cross-correlations  of  [Dt i}  and  [Dt 2}  were  shown  in 
Figure  8-6.  Without  taking  into  account  the  autocorrelations,  it  is  not  possible  to 
draw  any  conclusions  about  the  dependence  between  the  two  component  series  from 
the  cross-correlations. 

Examination  of  the  sample  cross-correlation  function  of  the  whitened  series  { Wt\ } 
and  { Wt 2 } ,  on  the  other  hand,  is  much  more  informative.  From  Figure  8-9  it  is  apparent 

/V  A 

that  there  is  one  large-sample  cross-correlation  (between  Wt+ 3,2  and  Wt,  1),  while  the 
others  are  all  between  zbl.96n_1/2. 

□ 

If  { Vf/i }  and  [Wt 2}  are  assumed  to  be  jointly  Gaussian,  Corollary  8.3.1  indicates 
the  compatibility  of  the  cross-correlations  with  a  model  for  which 

pi2(— 3)  ^  0 

and 


Pn(h)  =  0,  h  —3. 

The  value  pi2(— 3)  =  0.969  suggests  the  model 

Wf2  =  4.74Wr_3,i+A,  (8.3.2) 

where  the  stationary  noise  {Nt}  has  small  variance  compared  with  {W^}  and  { W^i }, 
and  the  coefficient  4.74  is  the  square  root  of  the  ratio  of  sample  variances  of  { Wtf\  and 


8.4  Multivariate  ARMA  Processes 


243 


{W^i}.  A  study  of  the  sample  values  of  [Wt  2  —  4.74Wf_3>i}  suggests  the  model 

(1  +  0.345B)Nt  =  Uu  {Ut}  ~  WN(0,  0.0782)  (8.3.3) 

A  A 

for  {N;}.  Finally,  replacing  Wt2  and  Wt- 34  in  (8.3.2)  by  Za  and  Zr_3  1,  respectively,  and 
then  using  (8.1.1)  and  (8.1.2)  to  express  Zt^  and  Zt^,\  in  terms  of  {Dt 2}  and  [Dt  1},  we 
obtain  a  model  relating  { Dt{\ ,  {Dt 2},  and  [Ut  1},  namely, 

A2  +  0.0773  =  (1  -  0.610fl)(l  -  0.8385)"1  [4.74(1  -  0.474fi)“' A-3.1 

+  (1  +0.345B)_1t/3. 

This  model  should  be  compared  with  the  one  derived  later  in  Section  1 1.1  by  the  more 
systematic  technique  of  transfer  function  modeling. 


8.4  Multivariate  ARMA  Processes 

As  in  the  univariate  case,  we  can  define  an  extremely  useful  class  of  multivari¬ 
ate  stationary  processes  {Xr}  by  requiring  that  {X; }  should  satisfy  a  set  of  linear 
difference  equations  with  constant  coefficients.  Multivariate  white  noise  {Zt}  (see 
Definition  8.2.2)  is  a  fundamental  building  block  from  which  these  ARMA  processes 
are  constructed. 


Definition  8.4.1. 


{Xr}  is  an  ARMA(p,  q )  process  if  {Xr}  is  stationary  and  if  for  every  t, 

Xf  —  —  •  •  •  —  <PpXt-p  —  Zf  +  0iZ,_!  +  •  •  •  +  0qZjt-q,  (8.4.1) 

where  {Z^}  ~  WN(0,  5P).({Xr}isanARMA(p,#)  process  with  mean  /*,  if  {Xt—fi} 
is  an  ARMA (p,  q )  process.) 


Equations  (8.4.1)  can  be  written  in  the  more  compact  form 

<P(B)Xt  =  0(B) Z„  {Zt}  ~  WN(0,  ?),  (8.4.2) 

where  @(z)  :=  /  —  0\ z  —  ■  •  •  —  0pzp  and  0(z)  :=  /  +  ©iz  +  ■  ■  ■  +  0qzq  are  matrix¬ 
valued  polynomials,  /  is  the  m  x  m  identity  matrix,  and  5  as  usual  denotes  the  backward 
shift  operator.  (Each  component  of  the  matrices  @(z ),  0(z)  is  a  polynomial  with  real 
coefficients  and  degree  less  than  or  equal  to  p ,  g,  respectively.) 


Example  8.4.1 .  The  multivariate  AR(1)  process 

Setting  p  =  1  and  q  =  0  in  (8.4.1)  gives  the  defining  equations 

Xt  =  <PXt_x  +  Zu  {Zt}  ~  WN(0,  ?),  (8.4.3) 

for  the  multivariate  AR(1)  series  {Xr}.  By  exactly  the  same  argument  as  used  in 
Example  2.2.1,  we  can  express  Xt  as 

oo 

Xt  =  J2  ®%-j,  (8.4.4) 

j= 0 

provided  that  all  the  eigenvalues  of  0  are  less  than  1  in  absolute  value,  i.e.,  provided 
that 
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det(7  —  z0)  7^  0 


for  all  z  e  C  such  that  \z  <  1. 


(8.4.5) 


If  this  condition  is  satisfied,  then  the  coefficients  0j  are  absolutely  summable,  and 
hence  the  series  in  (8.4.4)  converges;  i.e.,  each  component  of  the  matrix  Y^=o 
converges  (see  Remark  1  of  Section  2.2).  The  same  argument  as  in  Example  2.2.1  also 
shows  that  (8.4.4)  is  the  unique  stationary  solution  of  (8.4.3).  The  condition  that  all 
the  eigenvalues  of  0  should  be  less  than  1  in  absolute  value  (or  equivalently  (8.4.5)) 
is  just  the  multivariate  analogue  of  the  condition  \</>\  <  1  required  for  the  existence  of 
a  causal  stationary  solution  of  the  univariate  AR(1)  equations  (2.2.8). 

□ 

Causality  and  invertibility  of  a  multivariate  ARMA (p,  q)  process  are  defined 
precisely  as  in  Section  3.1,  except  that  the  coefficients  x//j,  ttj  in  the  representations 
Xt  —  o  Vo^-7  and  Zt  =  o  njXt-j  are  replaced  by  m  x  m  matrices  0j 
and  Tlj  whose  components  are  required  to  be  absolutely  summable.  The  following 
two  theorems  (proofs  of  which  can  be  found  in  Brockwell  and  Davis  (1991))  provide 
us  with  criteria  for  causality  and  invertibility  analogous  to  those  of  Section  3.1. 


Causality: 

An  ARMA (p,  q)  process  {XJ  is  causal,  or  a  causal  function  of  {Zr},  if  there 
exist  matrices  {0j}  with  absolutely  summable  components  such  that 


oo 


Xt  =  ^  'PjZt-j  for  all  t. 

j= o 

Causality  is  equivalent  to  the  condition 


(8.4.6) 


det  0(z)  7^  0  for  all  z  G  C  such  that  \z\  <  1. 


(8.4.7) 


The  matrices  0j  are  found  recursively  from  the  equations 

oo 

Vj  =  &j  +  J2<pkqfj-k’ 

k=  1 


(8.4.8) 


where  we  define  ®o  =  L  ©j  —  0  for  j  >  q ,  0j  =  0  for  j  >  p ,  and  0j  =  0  for 

j  <  0. 


Invertibility: 

An  ARMA(/7,  q)  process  {Xr}  is  invertible  if  there  exist  matrices  {Tlj}  with 
absolutely  summable  components  such  that 

oo 

Z,  =  Y2  nJxt-j  for  all  t.  (8.4.9) 

j= o 

Invertibility  is  equivalent  to  the  condition 

det  0  (z)  7^  0  for  all  z  G  C  such  that  \z\  <  1.  (8.4.10) 

The  matrices  IIy  are  found  recursively  from  the  equations 

oo 

Tlj  =  —0j  —  0kTlj-k,  j  —  0,  1, ... ,  (8.4.11) 

k=  l 

where  we  define  0q  =  — 1 ,  0j  =  0  for  j  >  p ,  0 y  =  0  for  j  >  q ,  and  Tlj  =  0  for 
j  <  0. 
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Example  8.4.2.  For  the  multivariate  AR(1)  process  defined  by  (8.4.3),  the  recursions  (8.4.8)  give 

Vo  =  I, 

0\  =  0%  =  0, 
q/2  =  =  02, 


Vj  =  Wj-i  =  ®j,  j  >  3, 
as  already  found  in  Example  8.4.1. 

□ 


Remark  3.  For  the  bivariate  AR(1)  process  (8.4.3)  with 


0  0.5 
0  0 


it  is  easy  to  check  that  0j  —  0j  =  0  for  j  >  1  and  hence  that  {Xt}  has  the  alternative 
representation 


Xr  —  Zdt  +  0Zt_i 

as  an  MA(1)  process.  This  example  shows  that  it  is  not  always  possible  to  distinguish 
between  multivariate  ARMA  models  of  different  orders  without  imposing  further 
restrictions.  If,  for  example,  attention  is  restricted  to  pure  AR  processes,  the  prob¬ 
lem  does  not  arise.  For  detailed  accounts  of  the  identification  problem  for  general 
ARMA (p,  q )  models  see  Hannan  and  Deistler  (1988)  and  Fiitkepohl  (1993).  □ 


8.4.1  The  Covariance  Matrix  Function  of  a  Causal  ARMA  Process 

From  (8.2.13)  we  can  express  the  covariance  matrix  F(h)  =  E(Xt+hX't)  of  the  causal 
process  (8.4.6)  as 

oo 

r(h)  =  J2  h  =  0,  ±1, ,  (8.4.12) 

j= o 

where  the  matrices  0}  are  found  from  (8.4.8)  and  0,  :=  0  for  j  <  0. 

The  covariance  matrices  T(/i),  h  =  0,  ±1,  . . .,  can  also  be  found  by  solving  the 
Yule-Walker  equations 


p 

rO')  -  F  ^rO'  -  r)  =  Y  &r^r-j,  j  =  o,  i,  2, ... , 

r=  1  j<r<q 

(8.4.13) 

obtained  by  postmultiplying  (8.4.1)  by  X't_j  and  taking  expectations.  The  first  p  +  1  of 
the  equation  (8.4.13)  can  be  solved  for  the  components  of  T(0 ),...,  T(p)  using  the 
fact  that  r(—h)  =  T\h).  The  remaining  equations  then  give  T(p  +  1),  T(p  +  2),  ... 
recursively.  An  explicit  form  of  the  solution  of  these  equations  can  be  written  down 
by  making  use  of  Kronecker  products  and  the  vec  operator  (see  e.g.,  Fiitkepohl  1993). 

Remark  4.  If  zo  is  the  root  of  det  0  (z)  =0  with  smallest  absolute  value,  then  it  can 
be  shown  from  the  recursions  (8.4.8)  that  0j/rj  — >  0  as  j  — >  oo  for  all  r  such  that 
|  Zo  |— 1  <  r  <  1 .  Hence,  there  is  a  constant  C  such  that  each  component  of  0j  is  smaller 
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in  absolute  value  than  CrK  This  implies  in  turn  that  there  is  a  constant  K  such  that  each 
component  of  the  matrix  Eh+j  ?  on  the  right  of  (8.4. 12)  is  bounded  in  absolute  value 

by  Krlj .  Provided  that  |zo I  is  not  very  close  to  1,  this  means  that  the  series  (8.4. 12)  con¬ 
verges  rapidly,  and  the  error  incurred  in  each  component  by  truncating  the  series  after 
the  term  with  j  =  k  —  1  is  smaller  in  absolute  value  than  Kr 2/  =  Krlk /  (l  —  r2). 


8.5  Best  Linear  Predictors  of  Second-Order  Random  Vectors 

Let  {X,  =  ( X,  i . . . . .  X,,„ )' }  be  an  m-variate  time  series  with  means  EXt  =  //,,  and 
covariance  function  given  by  the  m  x  m  matrices 

KU.j)  =  E  (X,X')  — 

If  Y  =  (Fi ,  . . . ,  Ym)'  is  a  random  vector  with  finite  second  moments  and  EY  =  ft,  we 
define 


Pn(Y)  =  (PnYu...,PnYm)', 


(8.5.1) 


where  PnYj  is  the  best  linear  predictor  of  the  component  Yj  of  Y  in  terms  of  all 
of  the  components  of  the  vectors  Xt,t  =  1 , ...  ,n,  and  the  constant  1.  It  follows 
immediately  from  the  properties  of  the  prediction  operator  (Section  2.5)  that 


Pn(y)  —  E  +  Ai(X„  —  fln)  +  •  •  •  +  AW(X  1  —  fli) 
for  some  matrices  A\, ,  Am  and  that 

Y  —  Pn(¥)  1  Xn+i_j,  i=l,...,n, 


(8.5.2) 

(8.5.3) 


where  we  say  that  two  m- dimensional  random  vectors  X  and  Y  are  orthogonal  (written 
X  _L  Y)  if  £’(XY/)  is  a  matrix  of  zeros.  The  vector  of  best  predictors  (8.5.1)  is  uniquely 
determined  by  (8.5.2)  and  (8.5.3),  although  it  is  possible  that  there  may  be  more  than 
one  possible  choice  for  Ai,  . . . ,  An. 

As  a  special  case  of  the  above,  if  {XJ  is  a  zero-mean  time  series,  the  best  linear 

predictor  Xn+i  of  Xn+i  in  terms  of  X] . X„  is  obtained  on  replacing  Y  by  Xn+i  in 

(8.5.1).  Thus 

fo,  if  «  =  0, 

X„+i  = 

[p„(X„+1),  if  n  >  1 . 

Hence,  we  can  write 

Xft+i  =  <Pn\Xn  +  •  •  •  +  0mXi,  n  =  1,2,...,  (8.5.4) 

where,  from  (8.5.3),  the  coefficients  <Pnj,  j  =  1, . . . ,  n,  are  such  that 

E  (±n+lXn+i-)  ~  P  (X^+iX^+i_-) ,  i  =  1, . . . ,  n,  (8.5.5) 

i.e., 


n 

0njK(n  +  1  —  j,  n  +  1  —  /)  =  K(n  +  1,  n  +  1  —  /),  /  =  1,  . . . ,  n. 

7=1 

In  the  case  where  {Xr}  is  stationary  with  K(i ,  j)  —  Y (i  —  j),  the  prediction  equations 
simplify  to  the  m- dimensional  analogues  of  (2.5.7),  i.e., 


Y.^njTd-j)  =  rti),  1=1, 


n. 


•  •  • 


(8.5.6) 
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Provided  that  the  covariance  matrix  of  the  nm  components  of  Xi,  . . . ,  Xn  is  nonsin¬ 
gular  for  every  n  >  1,  the  coefficients  {<£ nj }  can  be  determined  recursively  using 
a  multivariate  version  of  the  Durbin-Levinson  algorithm  given  by  Whittle  (1963) 
(for  details  see  Brockwell  and  Davis  (1991),  Proposition  11.4.1).  Whittle’s  recursions 
also  determine  the  covariance  matrices  of  the  one-step  prediction  errors,  namely, 
V0  =  r(0)  and,  for  n  >  1, 


Vn  —  E(Xn+\  —  Xn+i)(Xw+i  —  Xw_|_i)' 


=  r(0)  -  0nir(-i) - <pmr(-n).  (8.5.7) 

Remark  5.  The  innovations  algorithm  also  has  a  multivariate  version  that  can  be  used 
for  prediction  in  much  the  same  way  as  the  univariate  version  described  in  Section  2.5.4 
(for  details  see  Brockwell  and  Davis  (1991),  Proposition  11.4.2).  □ 
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If  {XJ  is  any  zero-mean  second-order  multivariate  time  series,  it  is  easy  to  show  from 

A 

the  results  of  Section  8.5  (Problem  8.4)  that  the  one-step  prediction  errors  Xy  —  Xy, 
j  —  1,  . . . ,  n,  have  the  property 

E  (Xj  -  X,)  (x,  -  X*)'  =  0  for;  ^  k.  (8.6.1) 

Moreover,  the  matrix  M  such  that 


"Xi  -  xr 

”xf 

x2-x2 

X2 

X3-X3 

• 

=  M 

X3 

• 

(8.6.2) 

• 

• 

_X„  -  x„_ 

• 

_X„_ 

is  lower  triangular  with  ones  on  the  diagonal  and  therefore  has  determinant  equal  to  1. 
If  the  series  {XJ  is  also  Gaussian,  then  (8.6.1)  implies  that  the  prediction  errors 

A 

C;  =  X,-  —  X,-.;'  =  I . n,  are  independent  with  covariance  matrices  Vo,  ,  Vn- 1 , 

respectively  (as  specified  in  (8.5.7)).  Consequently,  the  joint  density  of  the  prediction 
errors  is  the  product 


f(uu  ...,U„)  =  (2 7t) 


—nm/2 


Since  the  determinant  of  the  matrix  M  in  (8.6.2)  is  equal  to  1,  the  joint  density  of  the 
observations  Xi,  . . . ,  Xn  at  xi,  . . . ,  xn  is  obtained  on  replacing  ui,  . . . ,  u„  in  the  last 

v\ 

expression  by  the  values  of  Xy  —  Xy  corresponding  to  the  observations  xi,  . . . ,  x„. 

If  we  suppose  that  {XJ  is  a  zero-mean  m-variate  AR (p)  process  with  coefficient 
matrices  0  =  {&\, . . . ,  @p}  and  white  noise  covariance  matrix  If,  we  can  therefore 
express  the  likelihood  of  the  observations  Xi,  . . . ,  Xn  as 


—nm/2 


n 


-1/2 


n  detv/-i 

;= i 


exp 


L(<P,  ?)  =  (2? r) 
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where  U;  =  Xj  —  Xj,  j  =  1,  . . . ,  n ,  and  Xj  and  V}  are  found  from  (8.5.4),  (8.5.6), 
and  (8.5.7). 

Maximization  of  the  Gaussian  likelihood  is  much  more  difficult  in  the  multivariate 
than  in  the  univariate  case  because  of  the  potentially  large  number  of  parameters 
involved  and  the  fact  that  it  is  not  possible  to  compute  the  maximum  likelihood 
estimator  of  0  independently  of  If  as  in  the  univariate  case.  In  principle,  maximum 
likelihood  estimators  can  be  computed  with  the  aid  of  efficient  nonlinear  optimization 
algorithms,  but  it  is  important  to  begin  the  search  with  preliminary  estimates  that  are 
reasonably  close  to  the  maximum.  For  pure  AR  processes  good  preliminary  estimates 
can  be  obtained  using  Whittle’s  algorithm  or  a  multivariate  version  of  Burg’s  algorithm 
given  by  Jones  (1978).  We  shall  restrict  our  discussion  here  to  the  use  of  Whit¬ 
tle’s  algorithm  (the  multivariate  option  AR-Model>Estimation>Yule -Walker 
in  ITSM),  but  Jones’s  multivariate  version  of  Burg’s  algorithm  is  also  available 
(AR- Model  >Est  imat  ion>Burg).  Other  useful  algorithms  can  be  found  in  Liitke- 
pohl  (1993),  in  particular  the  method  of  conditional  least  squares  and  the  method  of 
Hannan  and  Rissanen  (1982),  the  latter  being  useful  also  for  preliminary  estimation  in 
the  more  difficult  problem  of  fitting  ARMA (p,  q)  models  with  q  >  0.  Spectral  meth¬ 
ods  of  estimation  for  multivariate  ARMA  processes  are  also  frequently  used.  A  dis¬ 
cussion  of  these  (as  well  as  some  time-domain  methods)  is  given  in  Anderson  (1980). 

Order  selection  for  multivariate  autoregressive  models  can  be  made  by  minimizing 
a  multivariate  analogue  of  the  univariate  AICC  statistic 

2  (pm2  +  1  )nm 

AICC  =  -21nL(<Z>!,  . . . ,  <Pp,  ?)  +  — - ^ — .  (8.6.3) 

run  —  pm 2  —  2 


8.6.1  Estimation  for  Autoregressive  Processes  Using  Whittle's  Algorithm 

If  {XJ  is  the  (causal)  multivariate  AR (p)  process  defined  by  the  difference  equations 
x,  =  <J>iX,_!  +  •  •  •  +  &pXt_p  +  Zt,  {ZJ  -  WN(0,  ?),  (8.6.4) 

then  postmultiplying  by  X'  .,  j  =  0 and  taking  expectations  gives  the  equations 

p 

?  =  r(0)  -  y]  0jr(-i)  (8.6.5) 

7=1 

and 

n 

Hi)  =  J2$P(i-j),  i=l,...,p.  (8.6.6) 

7=1 

Given  the  matrices  T(0),  . . . ,  T(/?),  equation  (8.6.6)  can  be  used  to  determine  the 
coefficient  matrices  &i,  . . . ,  &p.  The  white  noise  covariance  matrix  If  can  then 
be  found  from  (8.6.5).  The  solution  of  these  equations  for  . . . ,  &p,  and  If  is 
identical  to  the  solution  of  (8.5.6)  and  (8.5.7)  for  the  prediction  coefficient  matrices 
&pi,  . . . ,  &pp  and  the  corresponding  prediction  error  covariance  matrix  Vp.  Conse¬ 
quently,  Whittle’s  algorithm  can  be  used  to  carry  out  the  algebra. 

AAA 

The  Yule- Walker  estimators  <2>i,  . . . ,  @p,  and  If  for  the  model  (8.6.4)  fitted  to 

A 

the  data  X\,  ...  ,Xn  are  obtained  by  replacing  F(j)  in  (8.6.5)  and  (8.6.6)  by  T(j), 
j  =  0 and  solving  the  resulting  equations  for  . . . ,  &p,  and  If.  The 
solution  of  these  equations  is  obtained  from  ITSM  by  selecting  the  multivariate 
option  AR-Model>Estimation>Yule -Walker.  The  mean  vector  of  the  fitted 
model  is  the  sample  mean  of  the  data,  and  Whittle’s  algorithm  is  used  to  solve  the 


8.6  Modeling  and  Forecasting  with  Multivariate  AR  Processes 


249 


Example  8.6.1 


Example  8.6.2 


equations  (8.6.5)  and  (8.6.6)  for  the  coefficient  matrices  and  the  white  noise  covariance 
matrix.  The  fitted  model  is  displayed  by  ITSM  in  the  form 

Xf  =  0o  +  <P{Xt-i  +  •  •  •  +  0pXt_p  +  Z„  {Zt}  ~  WN(0,  5p). 

Note  that  the  mean  fi  of  this  model  is  not  the  vector  </>0?  but 

/l  =  (1-0  j - 0p)“  Vo- 

In  fitting  multivariate  autoregressive  models  using  ITSM,  check  the  box  Find 
minimum  AICC  model  to  find  the  AR (p)  model  with  0  <  p  <  20  that  mini¬ 
mizes  the  AICC  value  as  defined  in  (8.6.3). 

Analogous  calculations  using  Jones’s  multivariate  version  of  Burg’s  algorithm  can 
be  carried  out  by  selecting  AR-Model>Estimation>Burg. 


The  Dow  Jones  and  All  Ordinaries  Indices 


To  find  the  minimum  AICC  Yule- Walker  model  (of  order  less  than  or  equal  to  20)  for 
the  bivariate  series  {(Xti,Xt2)f,  t  =  1, . . . ,  250}  of  Example  8.1.1,  proceed  as  follows. 
Select  File>Proj  ect >Open>  Multivariate,  click  OK,  and  then  double-click 
on  the  file  name,  DJAOPC2.TSM.  Check  that  Number  of  columns  is  set  to  2, 
the  dimension  of  the  observation  vectors,  and  click  OK  again  to  see  graphs  of  the  two 
component  time  series.  No  differencing  is  required  (recalling  from  Example  8.1.1  that 
{Xti}  and  {Xa}  are  the  daily  percentage  price  changes  of  the  original  Dow  Jones  and 
All  Ordinaries  Indices).  Select  AR-Model>Estimation>Yule -Walker,  check 
the  box  Find  minimum  AICC  Model,  click  OK,  and  you  will  obtain  the  model 


Xu 

X/2 


where 


0.0288 

0.00836 


-0.0148  0.0357 
0.6589  0.0998 


v-i.r 

+ 

zfl- 

_Zt2_ 

0.3653  0.02241  \ 
0.0224  0.6016_|/  * 


□ 


Sales  with  a  leading  indicator 

The  series  [Yt{\  (leading  indicator)  and  {Yl2}  (sales)  are  stored  in  bivariate  form 
(Yti  in  column  1  and  Yt 2  in  column  2)  in  the  file  LS2.TSM.  On  opening  this  file 
in  ITSM  you  will  see  the  graphs  of  the  two  component  time  series.  Inspection  of 
the  graphs  immediately  suggests,  as  in  Example  8.2.2,  that  the  differencing  operator 
V  =  1  —  B  should  be  applied  to  the  data  before  a  stationary  AR  model  is  fitted.  Select 
Transf orm>Dif  f erence  and  specify  1  for  the  differencing  lag.  Click  OK  and 
you  will  see  the  graphs  of  the  two  differenced  series.  Inspection  of  the  series  and 
their  correlation  functions  (obtained  by  pressing  the  second  yellow  button  at  the  top 
of  the  ITSM  window)  suggests  that  no  further  differencing  is  necessary.  The  next 
step  is  to  select  AR-model>Estimation>Yule -Walker  with  the  option  Find 
minimum  AICC  model.  The  resulting  model  has  order  p  =  5  and  parameters 
0o  =  (0.0328  0.0156/, 


'-0.517 

0.024  ' 

A 

'-0.192 

-0.018' 

A 

'-0.073 

0.010' 

-0.019 

— 0.051_ 

»  ^2  = 

_  0.047 

0.250  _ 

>  ^3  = 

_  4.678 

0.207_ 
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'-0.032 

-0.009' 

yv 

"0.022 

0.011' 

A 

"  0.076 

-0.003' 

_  3.664 

0.004  _ 

,  05  = 

1.300 

0.029_ 

5  ?  — 

-0.003 

0.095  _ 

with  AICC=  109.49.  (Analogous  calculations  using  Burg’s  algorithm  give  an  AR(8) 
model  for  the  differenced  series.)  The  sample  cross-correlations  of  the  residual 

A 

vectors  Zr  can  be  plotted  by  clicking  on  the  last  blue  button  at  the  top  of  the  ITSM 
window.  These  are  nearly  all  within  the  bounds  ±1.96 /^/n,  suggesting  that  the 
model  is  a  good  fit.  The  components  of  the  residual  vectors  themselves  are  plot¬ 
ted  by  selecting  AR  Model>Residual  Analysis>Plot  Residuals.  Sim¬ 
ulated  observations  from  the  fitted  model  can  be  generated  using  the  option  AR 
Model  >Simulate.  The  fitted  model  has  the  interesting  property  that  the  upper  right 
component  of  each  of  the  co-  efficient  matrices  is  close  to  zero.  This  suggests  that  {Xt  \ } 
can  be  effectively  modeled  independently  of  f X /2}.  In  fact,  the  MA(1)  model 

Xtl  =  (1  -  0.4745)  t/„  {Ut}  ~  WN(0,  0.0779),  (8.6.7) 


provides  an  adequate  fit  to  the  univariate  series  {Xri}.  Inspecting  the  bottom  rows  of 
the  coefficient  matrices  and  deleting  small  entries,  we  find  that  the  relation  between 
{X?i}  and  {Xt2}  can  be  expressed  approximately  as 


0.250X?_2,2  ±0.207Xr_3,2  ±4.678X,_3,i  +  3.664X,_4,i  +  1.300X,_5fl  +  Wu 


or  equivalently, 


4.67853(1  +  0.7836  +  0.27862) 
1  -  0.25062  -  0.20763 


Vi  + 


Wt 

1  -  0.25062  -  0.20763  ’ 


(8.6.8) 


where  [Wt]  ~  WN(0,  0.095).  Moreover,  since  the  estimated  noise  covariance  matrix  is 
essentially  diagonal,  it  follows  that  the  two  sequences  {X?i}  and  {Wt}  are  uncorrelated. 
This  reduced  model  defined  by  (8.6.7)  and  (8.6.8)  is  an  example  of  a  transfer  function 
model  that  expresses  the  “output”  series  {Xt2}  as  the  output  of  a  linear  filter  with  “input” 
{Xn }  plus  added  noise.  A  more  direct  approach  to  the  fitting  of  transfer  function  models 
is  given  in  Section  11.1  and  applied  to  this  same  data  set. 

□ 


8.6.2  Forecasting  Multivariate  Autoregressive  Processes 

The  technique  developed  in  Section  8.5  allows  us  to  compute  the  minimum  mean 

A 

squared  error  one-step  linear  predictors  Xn+i  for  any  multivariate  stationary  time  series 
from  the  mean  /*,  and  autocovariance  matrices  V(h)  by  recursively  determining  the 
coefficients  &nn  i  =  1, ...  ,n,  and  evaluating 

Xw+1  =  [l  +  <Pn  i(X„  —  /!')  +  •••  +  ^nn(X  1  —  fl).  (8.6.9) 

The  situation  is  simplified  when  {Xr}  is  the  causal  AR (p)  process  defined  by 
(8.6.4),  since  for  n  >  p  (as  is  almost  always  the  case  in  practice) 

=  d>iXn  ±  •  •  •  +  <PpXn+i-p.  (8.6.10) 

To  verify  (8.6.10)  it  suffices  to  observe  that  the  right-hand  side  has  the  required  form 
(8.5.2)  and  that  the  prediction  error 
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is  orthogonal  to  Xi,  . . . ,  Xn  in  the  sense  of  (8.5.3).  (In  fact,  the  prediction  error  is 
orthogonal  to  all  Xj,  — oo  <  j  <  n,  showing  that  if  n  >  p,  then  (8.6.10)  is  also  the 
best  linear  predictor  of  Xn+i  in  terms  of  all  components  of  Xy,  —  oo  <  j  <  n.)  The 
covariance  matrix  of  the  one-step  prediction  error  is  clearly  E(Zn+iZ'n+])  =  If. 

To  compute  the  best  h- step  linear  predictor  PnXn+h  based  on  all  the  components 
ofXi,...,X„  we  apply  the  linear  operator  Pn  to  (8.6.4)  to  obtain  the  recursions 


PnX 


n+h 


=  <PyP„X 


\r  n^n+h—  1 


+  *  *  *  +  ^ pP  n^n+h—p  • 


(8.6.11) 


These  equations  are  easily  solved  recursively,  first  for  PnXn+ i,  then  for  PnXn+ 2, 
PnXn+ 3,  . . .,  etc.  If  n  >  p,  then  the  h- step  predictors  based  on  all  components  of 
Xy,  -00  <  j  <  n ,  also  satisfy  (8.6.11)  and  are  therefore  the  same  as  the  h- step 
predictors  based  on  Xi,  . . . ,  Xn. 

To  compute  the  h- step  error  covariance  matrices,  recall  from  (8.4.6)  that 

00 

X n+h  —  E  1 PjZn+h-j ,  (8.6.12) 

j= 0 


where  the  coefficient  matrices  ^  are  found  from  the  recursions  (8.4.8)  with  q  =  0. 
From  (8.6.12)  we  find  that  for  n  >  p. 


PnX 


n+h 


00 

{pjZn+h~j- 

j—h 


(8.6.13) 


Subtracting  (8.6.13)  from  (8.6.12)  gives  the  h- step  prediction  error 


^n+h  Pyi^n+h 


h-l 


j= 0 


j^n+h—ji 


with  covariance  matrix 


(8.6.14) 


h-l 

E  \_(Xn+h  —  PnXn+h)(Xn+h  —  Pn^n+hY]  —  ,  n  >  p. 

j= 0 


For  the  (not  necessarily  zero-mean)  causal  AR (p)  process  defined  by 


(8.6.15) 


—  0o  +  <P\Xt_i  +  •  •  •  +  <PpXt-p  +  Zt,  [Zt]  ~  WN(0,  If), 

Equations  (8.6.10)  and  (8.6.1 1)  remain  valid,  provided  that  0O  is  added  to  each  of  their 
right-hand  sides.  The  error  covariance  matrices  are  the  same  as  in  the  case  </>o  =  0. 

The  above  calculations  are  all  based  on  the  assumption  that  the  AR (p)  model 
for  the  series  is  known.  However,  in  practice,  the  parameters  of  the  model  are  usually 
estimated  from  the  data,  and  the  uncertainty  in  the  predicted  values  of  the  series  will  be 
larger  than  indicated  by  (8.6.15)  because  of  parameter  estimation  errors.  See  Liitkepohl 
(1993). 


Example  8.6.3  The  Dow  Jones  and  All  Ordinaries  Indices 

The  VAR(l)  model  fitted  to  the  series  [Xt,  t  =  1, . . . ,  250}  in  Example  8.6.1  was 


Xu 

X?2 


0.0288 

0.00836 


-0.0148  0.0357 
0.6589  0.0998 


X-1.1" 

+ 

zfr 

+t- 1,2_ 

_Z,2_ 

0.3653  0.0224] \ 
0.0224  0.6016_|/  ' 


where 
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The  one-step  mean  squared  error  for  prediction  of  Xt2,  assuming  the  validity  of  this 
model,  is  thus  0.6016.  This  is  a  substantial  reduction  from  the  estimated  mean  squared 
error  722(0)  =  0.7712  when  the  sample  mean  jX2  =  0.0309  is  used  as  the  one-step 
predictor. 

If  we  fit  a  univariate  model  to  the  series  [Xt2\  using  ITSM,  we  find  that  the 
autoregression  with  minimum  AICC  value  (645.0)  is 

X,2  =  0.0273  +  0.1180X^1,2  +  Z„  {Ztj  ~  WN(0,  0.7604). 

Assuming  the  validity  of  this  model,  we  thus  obtain  a  mean  squared  error  for  one- 
step  prediction  of  0.7604,  which  is  slightly  less  than  the  estimated  mean  squared  error 
(0.7712)  incurred  when  the  sample  mean  is  used  for  one-step  prediction. 

The  preceding  calculations  suggest  that  there  is  little  to  be  gained  from  the 
point  of  view  of  one-step  prediction  by  fitting  a  univariate  model  to  {X/2},  while 
there  is  a  substantial  reduction  achieved  by  the  bivariate  AR(1)  model  for  {Xr  = 

(xn,xay}. 

To  test  the  models  fitted  above,  we  consider  the  next  forty  values  {Xt,  t  — 
251,  ... ,  290},  which  are  stored  in  the  file  DJAOPCF.TSM.  We  can  use  these  val¬ 
ues,  in  conjunction  with  the  bivariate  and  univariate  models  fitted  to  the  data  for 
t  —  1,  . . . ,  250,  to  compute  one-step  predictors  of  Xt2,  t  =  251,  ... ,  290.  The  results 
are  as  follows: 

Predictor  Average  Squared  Error 
pi  =  0.0309  0.4706 

AR(1)  0.4591 

VAR(l)  0.3962 

It  is  clear  from  these  results  that  the  sample  variance  of  the  series  {Xt2,  t  —  251, ... , 
290}  is  rather  less  than  that  of  the  series  {Xt2,  t  —  1,  . . . ,  250},  and  consequently, 
the  average  squared  errors  of  all  three  predictors  are  substantially  less  than  expected 
from  the  models  fitted  to  the  latter  series.  Both  the  AR(1)  and  VAR(l)  models  show 
an  improvement  in  one-step  average  squared  error  over  the  sample  mean  /x,  but  the 
improvement  shown  by  the  bivariate  model  is  much  more  pronounced. 

□ 

The  calculation  of  predictors  and  their  error  covariance  matrices  for  multivari¬ 
ate  ARIMA  and  SARIMA  processes  is  analogous  to  the  corresponding  univariate 
calculation,  so  we  shall  simply  state  the  pertinent  results.  Suppose  that  {Yt}  is 
a  nonstationary  process  satisfying  D(B)Yt  =  Ur  where  D(z )  =  1  —  d\z  —  •  •  •  —  drzr  is 
a  polynomial  with  D(  1)  =  0  and  {\Jt}  is  a  causal  invertible  ARMA  process  with  mean 
fi .  Then  Xt  =  \J t  —  fi  satisfies 

0(B)Xt  =  0(B) Z„  {Zt}  ~  WN(0,  ?).  (8.6.16) 

Under  the  assumption  that  the  random  vectors  Y_r+i,  . . . ,  Yo  are  uncorrelated  with 
the  sequence  {Zr},  the  best  linear  predictors  PnYj  of  Y >  n  >  0,  based  on  1  and 
the  components  of  Yy,  —r+l,<j<n,  are  found  as  follows.  Compute  the  observed 
values  of  Ut  =  D(B) Yt,  t  —  1, . . . ,  n,  and  use  the  ARMA  model  for  Xt  =  \Jt  —  fi  to 
compute  predictors  Pn\Jn+h.  Then  use  the  recursions 


p  V 
1  n  A 


n+h 


—  Pn\dn+h  T  E  djPnY 


n+h—j 


j= 1 


(8.6.17) 
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to  compute  successively  PnYn+\,  P/2Y,2+2,  P/2Y,2+3,  etc.  The  error  covariance  matrices 
are  approximately  (for  large  n ) 


h- 1 

E  [(Yn+h  -  Pn Yn+h)(Yn+h  -  PnYn+hy]  =  J2  (8.6.18) 

j= 0 


where  *  is  the  coefficient  of  z-7'  in  the  power  series  expansion 


oo 


J2  *j*zj  =  D{z)-l<p-l{z)0{z),  \z\  <  1. 

l=o 

The  matrices  are  most  readily  found  from  the  recursions  (8.4.8)  after  replacing 
0j ,  j  =  1, by  <£*,  j  =  1, +  r,  where  0*  is  the  coefficient  of  z-7'  in 
D(z)<£(z). 


Remark  6.  In  the  special  case  where  0(z)  =  /  (i.e.,  in  the  purely  autoregressive 
case)  the  expression  (8.6.18)  for  the  /i-step  error  covariance  matrix  is  exact  for  all 
n  >  p  (i.e.,  if  there  are  at  least  p  +  r  observed  vectors).  The  program  ITSM  allows 
differencing  transformations  and  subtraction  of  the  mean  before  fitting  a  multivariate 
autoregression.  Predicted  values  for  the  original  series  and  the  standard  deviations  of 
the  prediction  errors  can  be  determined  using  the  multivariate  option  Forecast- 
ing>AR  Model.  □ 


Remark  7.  In  the  multivariate  case,  simple  differencing  of  the  type  discussed  in  this 
section  where  the  same  operator  D(B )  is  applied  to  all  components  of  the  random 
vectors  is  rather  restrictive.  It  is  useful  to  consider  more  general  linear  transformations 
of  the  data  for  the  purpose  of  generating  a  stationary  series.  Such  considerations  lead 
to  the  class  of  cointegrated  models  discussed  briefly  in  Section  8.7  below.  □ 


Sales  with  a  leading  indicator 

Assume  that  the  model  fitted  to  the  bivariate  series  {Yr,  t  =  0,  . . . ,  149}  in  Exam¬ 
ple  8.6.2  is  correct,  i.e.,  that 

<P(B)Xt  =  Z„  {Zt}  ~  WN  (o,  ?)  , 

where 


X,  =  (1  -  B)Y,  -  (0.0228,  0.420)',  t=  1,  . . . ,  149, 

A  A  _  A  /V  A 

0(B)  =  I  —  (P i />’  —  ...  —  (/>s If .  and  (P\ . (lP .  are  the  matrices  found 

in  Example  8.6.2.  Then  the  one-  and  two-step  predictors  of  X150  and  X151  are  obtained 
from  (8.6.11)  as 

Z3 149X150  =  <PiXi49  +  •  •  •  +  CP5X145  = 

and 

149X151  =  01P149X15O  +  02V49  +  •  •  •  +  05X146  = 


-0.027 

0.816 


0.163 

-0.217 


with  error  covariance  matrices,  from  (8.6.15), 


0.076 

-0.003 


-0.003 

0.095 
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and 


Ip  (p\  Ip  (p  y 


0.096  -0.002 

-0.002  0.095 


respectively. 

Similarly,  the  one-  and  two-step  predictors  of  Y150  and  Y151  are  obtained  from 
(8.6.17)  as 


P 149  Y  150 


0.0228 

0.420 


+  P149X150  +  Y 149 


13.59 

262.90 


and 


P 149 Y 151  = 


'0.0228' 

'  13.59' 

_  0.420  _ 

+  P149X151  +  P149Y150  = 

264. 14_ 

with  error  covariance  matrices,  from  (8.6.18), 


and 


0.076 

-0.003 


-0.003 

0.095 


Ip  +  Ip  ll-\-0 1 


0.094  -0.003 

-0.003  0.181 


respectively.  The  predicted  values  and  the  standard  deviations  of  the  predictors  can 
easily  be  verified  with  the  aid  of  the  program  ITSM.  It  is  also  of  interest  to  compare  the 
results  with  those  obtained  by  fitting  a  transfer  function  model  to  the  data  as  described 
in  Section  11.1  below. 

□ 


8.7  Cointegration 


We  have  seen  that  nonstationary  univariate  time  series  can  frequently  be  made 
stationary  by  applying  the  differencing  operator  V  =  1  —  B  repeatedly.  If  { is 
stationary  for  some  positive  integer  d  but  { Vd~lXt]  is  nonstationary,  we  say  that  {XJ 
is  integrated  of  order  d ,  or  more  concisely,  {X^}  ~  1(d).  Many  macroeconomic 
time  series  are  found  to  be  integrated  of  order  1. 

If  {X/ }  is  a  variate  time  series,  we  define  |VJXf}  to  be  the  series  whose  jth 
component  is  obtained  by  applying  the  operator  (1  —  B)d  to  the  jth  component  of  {Xr}, 
j  —  1  The  idea  of  a  cointegrated  multivariate  time  series  was  introduced  by 

Granger  (1981)  and  developed  by  Engle  and  Granger  (1987).  Here  we  use  the  slightly 
different  definition  of  Liitkepohl  (1993).  We  say  that  the  ^-dimensional  time  series  {Xr} 
is  integrated  of  order  d  (or  {Xr}  ~  1(d))  if  d  is  a  positive  integer,  { VJXr}  is  stationary, 
and  {V^X,}  is  nonstationary.  The  1(d)  process  {Xr}  is  said  to  be  cointegrated  with 
cointegration  vector  a  if  a  is  a  k  x  1  vector  such  that  {cfXr}  is  of  order  less  than  d. 

Example  8.7.1  A  simple  example  is  provided  by  the  bivariate  process  whose  first  component  is  the 

random  walk 

t 

X'  =  J2ZJ’  t  —  1)2,...,  {Z,}  ~  IID  (0,  a 2) , 
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Problems 


and  whose  second  component  consists  of  noisy  observations  of  the  same  random  walk, 

Yt  =  Xt  +  Wu  t  —  1,2, ,  {W,}  ~  IID  (0,  r2)  , 

where  {WJ  is  independent  of  {Zt}.  Then  {(Xr,  Yt)r }  is  integrated  of  order  1  and 
cointegrated  with  cointegration  vector  a  =  (1,  —  1)7 

The  notion  of  cointegration  captures  the  idea  of  univariate  nonstationary  time 
series  “moving  together.”  Thus,  even  though  {Xr}  and  { Yt }  in  Example  8.7.1  are  both 
nonstationary,  they  are  linked  in  the  sense  that  they  differ  only  by  the  stationary 
sequence  {Wr}.  Series  that  behave  in  a  cointegrated  manner  are  often  encountered  in 
economics.  Engle  and  Granger  (1991)  give  as  an  illustrative  example  the  prices  of 
tomatoes  Ut  and  Vt  in  Northern  and  Southern  California.  These  are  linked  by  the  fact 
that  if  one  were  to  increase  sufficiently  relative  to  the  other,  the  profitability  of  buying 
in  one  market  and  selling  for  a  profit  in  the  other  would  tend  to  push  the  prices  (Ut,  Vt )' 
toward  the  straight  line  v  =  u  in  R2.  This  line  is  said  to  be  an  attractor  for  (Ut,  Vt)f, 
since  although  Ut  and  Vt  may  both  vary  in  a  nonstationary  manner  as  t  increases,  the 
points  (Ut,  Vty  will  exhibit  relatively  small  random  deviations  from  the  line  v  =  u. 

□ 

If  we  apply  the  operator  V  =  1  —  B  to  the  bivariate  process  defined  in  Example  8.7.1 
in  order  to  render  it  stationary,  we  obtain  the  series  (Ut,  Vt)\  where 

ut  =  zt 

and 


Vt  =  Zt  +  Wt  —  Wt-\ . 

The  series  {(Ut,  Vt)'}  is  clearly  a  stationary  multivariate  MA(1)  process 


~u; 

'i 

o' 

Z, 

'  0 

o' 

Z?_! 

vt_ 

_0 

i 

_zt  +  wt_ 

-1 

1 

- 1 

7 

+ 

7 

However,  the  process  {(Ut,  Vt)'}  cannot  be  represented  as  an  AR(oo)  process,  since 
the  matrix  [*  °J  has  zero  determinant  when  z  —  1,  thus  violating  condition 

(8.4. 10).  Care  is  therefore  needed  in  the  estimation  of  parameters  for  such  models  (and 
the  closely  related  error-correction  models).  We  shall  not  go  into  the  details  here  but 
refer  the  reader  to  Engle  and  Granger  (1987)  and  Liitkepohl  (1993). 

□ 


8.1  Let  {Yt}  be  a  stationary  process  and  define  the  bivariate  process  Xt\  —  Yu  Xt2  = 
Fr_j,  where  d  ^  0.  Show  that  {(Xt\,Xt2)'}  is  stationary  and  express  its  cross¬ 
correlation  function  in  terms  of  the  autocorrelation  function  of  {Yt}.  If  py(h)  ->  0 
as  h  — >  oo,  show  that  there  exists  a  lag  k  for  which  pn(k)  >  Pi2(0). 

8.2  Show  that  the  covariance  matrix  function  of  the  multivariate  linear  process  defined 
by  (8.2.12)  is  as  specified  in  (8.2.13). 

8.3  Let  {Xr}  be  the  bivariate  time  series  whose  components  are  the  MA(1)  processes 
defined  by 

Xn  = 


Z,,  +0.8Z,_U,  {Zt\j  ~  IID  (0,  ctj2)  , 
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and 

Xl2  =  Z,2  -  0.6Zr-i,2,  {Zt2j  ~  IID  (0,  a22)  , 

where  the  two  sequences  [Zt{\  and  {Zr2}  are  independent. 

a.  Find  a  large-sample  approximation  to  the  variance  of  nl/2p\2(h). 

b.  Find  a  large- sample  approximation  to  the  covariance  of  nl/2p  12(h)  and 
nl^2Pn(k)  for  h  ^  k. 


8.4  Use  the  characterization  (8.5.3)  of  the  multivariate  best  linear  predictor  of  Y  in 
terms  of  {Xi,  . .  .  X„}  to  establish  the  orthogonality  of  the  one-step  prediction 
errors  Xj  —  Xj  and  X^  —  X^j  ^  k ,  as  asserted  in  (8.6.1). 

8.5  Determine  the  covariance  matrix  function  of  the  ARM A(  1,1)  process  satisfying 

Xt  -  <PXt_x  =  Zt  +  0Zt_u  {Zt}  ~  WN(0,  /2), 
where  I2  is  the  2  x  2  identity  matrix  and  0  =  0'  —  [0Q5  0°55]. 

8.6  a.  Let  {XJ  be  a  causal  AR (p)  process  satisfying  the  recursions 


Xt  —  0{Xt-\  +  •  •  •  +  0pXt_p  +  zt, 


[Zt]  ~  WN(0,  If). 


For  n  >  p  write  down  recursions  for  the  predictors  PnXn+h,  h  >  0,  and 
find  explicit  expressions  for  the  error  covariance  matrices  in  terms  of  the  AR 
coefficients  and  If  when  h  =  1,2,  and  3. 
b.  Suppose  now  that  {YJ  is  the  multivariate  ARIMA(p,  1,  0)  process  satisfying 
VY^  =  Xt,  where  {XJ  is  the  AR  process  in  (a).  Assuming  that  i?(YoX')  =  0, 
for  t  >  1,  show  (using  (8.6.17)  with  r  —  1  and  d  —  1)  that 


h 


Pn(Yn+h)  ~  Xn 


+E'’- 

7=1 


V 

n^n+j  1 


and  derive  the  error  covariance  matrices  when  h  =  1,2,  and  3.  Compare  these 
results  with  those  obtained  in  Example  8.6.4. 


8.7  Use  the  program  ITSM  to  find  the  minimum  AICC  AR  model  of  order  less 
than  or  equal  to  20  for  the  bivariate  series  {(Xt\,  X^)',  t  =  1,  . . . ,  200}  with 
components  filed  as  APPJK2.TSM.  Use  the  fitted  model  to  predict  (Xti,Xt2)\ 
t  =  201,  202,  203  and  estimate  the  error  covariance  matrices  of  the  predictors 
(assuming  that  the  fitted  model  is  appropriate  for  the  data). 

8.8  Let  {Xti,  t  =  1,  . . . ,  63}  and  { Xt2 ,  t  —  1,  . . . ,  63}  denote  the  differenced  series 
{V  In  Yt  1}  and  {V  In  Ya },  where  {Yt{\  and  {Y^}  are  the  annual  mink  and  muskrat 
trappings  filed  as  APPH.TSM  and  APPI.TSM,  respectively). 

a.  Use  ITSM  to  construct  and  save  the  series  {Xri}  and  {X/2}  as  univariate 
data  files  Xl.TSM  and  X2.TSM,  respectively.  (After  making  the  required 
transformations  press  the  red  EXP  button  and  save  each  transformed  series  to 
a  file  with  the  appropriate  name.)  To  enter  XI  and  X2  as  a  bivariate  series  in 
ITSM,  open  XI  as  a  multivariate  series  with  Number  of  columns  equal 
to  1.  Then  open  X2  as  a  univariate  series.  Click  the  project  editor  button  (at 
the  top  left  of  the  ITSM  window),  click  on  the  plus  signs  next  to  the  projects 
XI  .TSM  and  X2.TSM,  then  click  on  the  series  that  appears  just  below  X2.TSM 
and  drag  it  to  the  first  line  of  the  project  Xl.TSM.  It  will  then  be  added  as  a 
second  component,  making  Xl.TSM  a  bivariate  project  consisting  of  the  two 
component  series  XI  and  X2.  Click  OK  to  close  the  project  editor  and  close 
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the  ITSM  window  labeled  X2.TSM.  You  will  then  see  the  graphs  of  XI  and 
X2.  Press  the  second  yellow  button  to  see  the  correlation  functions  of  {X,i}  and 
{Xa}.  For  more  information  on  the  project  editor  in  ITSM  consult  the  Project 
Editor  section  of  the  PDF  file  ITSM_HELP. 
b.  Conduct  a  test  for  independence  of  the  two  series  {X?i}  and  {X?i}. 

8.9  Use  ITSM  to  open  the  data  file  STOCK7.TSM,  which  contains  the  daily  returns 
on  seven  different  stock  market  indices  from  April  27th,  1998,  through  April 
9th,  1999.  (Consult  the  Data  Sets  section  of  the  PDF  file  ITSM_HELP  for  more 
information.)  Fit  a  multivariate  autoregression  to  the  trivariate  series  consisting 
of  the  returns  on  the  Dow  Jones  Industrials,  All  Ordinaries,  and  Nikkei  indices. 
Check  the  model  for  goodness  of  fit  and  interpret  the  results. 


9.1  State-Space  Representations 

9.2  The  Basic  Structural  Model 

9.3  State-Space  Representation  of  ARIMA  Models 

9.4  The  Kalman  Recursions 

9.5  Estimation  for  State-Space  Models 

9.6  State-Space  Models  with  Missing  Observations 

9.7  The  EM  Algorithm 

9.8  Generalized  State-Space  Models 


In  recent  years  state-space  representations  and  the  associated  Kalman  recursions 
have  had  a  profound  impact  on  time  series  analysis  and  many  related  areas.  The 
techniques  were  originally  developed  in  connection  with  the  control  of  linear  systems 
(for  accounts  of  this  subject  see  Davis  and  Vinter  1985;  Hannan  and  Deistler  1988). 
An  extremely  rich  class  of  models  for  time  series,  including  and  going  well  beyond 
the  linear  ARIMA  and  classical  decomposition  models  considered  so  far  in  this  book, 
can  be  formulated  as  special  cases  of  the  general  state-space  model  defined  below  in 
Section  9.1.  In  econometrics  the  structural  time  series  models  developed  by  Harvey 
(1990)  are  formulated  (like  the  classical  decomposition  model)  directly  in  terms  of 
components  of  interest  such  as  trend,  seasonal  component,  and  noise.  However,  the 
rigidity  of  the  classical  decomposition  model  is  avoided  by  allowing  the  trend  and 
seasonal  components  to  evolve  randomly  rather  than  deterministically.  An  introduction 
to  these  structural  models  is  given  in  Section  9.2,  and  a  state-space  representation  is 
developed  for  a  general  ARIMA  process  in  Section  9.3.  The  Kalman  recursions,  which 
play  a  key  role  in  the  analysis  of  state-space  models,  are  derived  in  Section  9.4.  These 
recursions  allow  a  unified  approach  to  prediction  and  estimation  for  all  processes 
that  can  be  given  a  state-space  representation.  Following  the  development  of  the 
Kalman  recursions  we  discuss  estimation  with  structural  models  (Section  9.5)  and 
the  formulation  of  state-space  models  to  deal  with  missing  values  (Section  9.6).  In 
Section  9.7  we  introduce  the  EM  algorithm,  an  iterative  procedure  for  maximizing  the 
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likelihood  when  only  a  subset  of  the  complete  data  set  is  available.  The  EM  algorithm 
is  particularly  well  suited  for  estimation  problems  in  the  state-space  framework.  Gen¬ 
eralized  state-space  models  are  introduced  in  Section  9.8.  These  are  Bayesian  models 
that  can  be  used  to  represent  time  series  of  many  different  types,  as  demonstrated  by 
two  applications  to  time  series  of  count  data.  Throughout  the  chapter  we  shall  use  the 
notation 


{W,}  ~  WN(0,  { Rt }) 


to  indicate  that  the  random  vectors  W,  have  mean  0  and  that 


e  (w5w;) 


Rt,  if  s  =  t, 

0,  otherwise. 


9.1  State-Space  Representations 

A  state-space  model  for  a  (possibly  multivariate)  time  series  {Y t,t  =  1,2,...} 
consists  of  two  equations.  The  first,  known  as  the  observation  equation,  expresses 
the  w-dimensional  observation  Yt  as  a  linear  function  of  a  v-dimensional  state  variable 
Xt  plus  noise.  Thus 

Yt  =  GtXt  +  Wf,  t=  1,2,...,  (9.1.1) 

where  {W,}  ~  WN(0,  {/^})  and  { Gt }  is  a  sequence  of  w  x  v  matrices.  The  second 
equation,  called  the  state  equation,  determines  the  state  at  time  t  +  1  in  terms  of 
the  previous  state  Xt  and  a  noise  term.  The  state  equation  is 

Xt+\  =  FtXt  +  Yt,  t  =  1,  2, . . . ,  (9.1.2) 

where  {Ft}  is  a  sequence  of  v  x  v  matrices,  {Vr}  ~  WN(0,  { Qt }),  and  {Vr}  is 
uncorrelated  with  {WJ  (i.e.,  E(WtY's)  =  0  for  all  s  and  t).  To  complete  the 
specification,  it  is  assumed  that  the  initial  state  Xi  is  uncorrelated  with  all  of  the  noise 
terms  {V^}  and  {Wr}. 

Remark  1.  A  more  general  form  of  the  state-space  model  allows  for  correlation 
between  Yt  and  Wr  (see  Brockwell  and  Davis  (1991),  Chapter  12)  and  for  the  addition 
of  a  control  term  Htut  in  the  state  equation.  In  control  theory,  Htut  represents  the  effect 
of  applying  a  “control”  ur  at  time  t  for  the  purpose  of  influencing  X,+i.  However,  the 
system  defined  by(9. l.l)and(9.1. 2)  with£'(WrYy)  =  0  for  ah  s  and  t  will  be  adequate 
for  our  purposes.  □ 

Remark  2.  In  many  important  special  cases,  the  matrices  Ft,  Gt ,  Qt ,  and  Rt  will 
be  independent  of  t ,  in  which  case  the  subscripts  will  be  suppressed.  □ 

Remark  3.  It  follows  from  the  observation  equation  (9. LI)  and  the  state  equation 
(9.1.2)  that  Xt  and  Yt  have  the  functional  forms,  for  t  =  2,  3,  . . ., 

Xt^F^X^FYt-i 

—  F t-\{F t_2Xt-2  +  Y  t—i)  +  Vf_i 


—  (^-i  •  •  •  Fi)Xi  +  (^-i  •  •  •  F2) V !  +  •••+  Ft-\ \t-2  +  Yt-i 
=  ft(XuVu 


... 


(9.1.3) 
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Example  9.1.2 


and 

Yf  =  ft(X1,V1,...,Vf_1,Wf).  □  (9.1.4) 

Remark  4.  From  Remark  3  and  the  assumptions  on  the  noise  terms,  it  is  clear  that 

e (vrx;)  =  o,  e (v,y;)  =  o,  i  <s<t, 

and 

E  (Wfx;)  =0,  1  <s<t,  E(W,Y'S)  =  0,  1  <s  <t.  □ 

A  time  series  {Y^}  has  a  state-space  representation  if  there  exists  a  state-space 
model  for  {Y^}  as  specified  by  equations  (9.1.1)  and  (9.1.2). 


As  already  indicated,  it  is  possible  to  find  a  state-space  representation  for  a  large 
number  of  time-series  (and  other)  models.  It  is  clear  also  from  the  definition  that 
neither  {Xr}  nor  { Yr}  is  necessarily  stationary.  The  beauty  of  a  state-space  representa¬ 
tion,  when  one  can  be  found,  lies  in  the  simple  structure  of  the  state  equation  (9.1.2), 
which  permits  relatively  simple  analysis  of  the  process  {Xr}.  The  behavior  of  { Yr} 
is  then  easy  to  determine  from  that  of  {Xr}  using  the  observation  equation  (9.1.1). 
If  the  sequence  {Xi,  Vi,  V2,  . . .}  is  independent,  then  {Xr}  has  the  Markov  property; 
i.e.,  the  distribution  of  Xf+i  given  Xt, . . . ,  Xi  is  the  same  as  the  distribution  of  X,+i 
given  X;.  This  is  a  property  possessed  by  many  physical  systems,  provided  that  we 
include  sufficiently  many  components  in  the  specification  of  the  state  Xt  (for  example, 
we  may  choose  the  state  vector  in  such  a  way  that  Xt  includes  components  of  Xr_i  for 
each  t). 

An  AR(1)  Process 

Let  {Yt}  be  the  causal  AR(1)  process  given  by 

Yt  =  0Yt_i  +  Z,,  {Z(}~WN(0,cr2).  (9.1.5) 

In  this  case,  a  state-space  representation  for  {Yt}  is  easy  to  construct.  We  can,  for 
example,  define  a  sequence  of  state  variables  Xt  by 

Xt+i  =  (j)Xt  +  Vu  t=  1,2,...,  (9.1.6) 

where  X\  —  Y\  —  W  and  ^  —  Zt+ 1-  The  process  {Yt}  then  satisfies  the 

observation  equation 

Yt  =  Xu 

which  has  the  form  (9.1.1)  with  Gt  =  1  and  Wt  —  0. 

□ 


An  ARM A(  1,1)  Process 


Let  {FJ  be  the  causal  and  invertible  ARMA(1,1)  process  satisfying  the  equations 

Yt  =  0Y,_!  +  Zt  +  0Z,-U  {Z,}  ~  WN  (0,  ff2)  .  (9.1.7) 

Although  the  existence  of  a  state-space  representation  for  { Yt }  is  not  obvious,  we  can 
find  one  by  observing  that 


Yt  =  0(B)Xt 


1 


(9.1.8) 
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where  {XJ  is  the  causal  AR(1)  process  satisfying 
<P(B)Xt  =  Zt, 


or  the  equivalent  equation 


X,+i 


0  1 

0  (p 


X-i 
X, , 


(9.1.9) 


Noting  that  X,  =  Yljlo  ./■  we  see  that  equations  (9. 1.8)  and  (9. 1 .9)  for  t  =  1,2,... 
furnish  a  state-space  representation  of  { Yt)  with 


X-i 


and  Xi  = 


OO 


zrz-j 

j= 0 


OO 


U=o 


The  extension  of  this  state-space  representation  to  general  ARMA  and  ARIMA  pro¬ 
cesses  is  given  in  Section  9.3. 

□ 

In  subsequent  sections  we  shall  give  examples  that  illustrate  the  versatility  of  state- 
space  models.  (More  examples  can  be  found  in  Aoki  1987;  Hannan  and  Deistler  1988; 
Harvey  1990;  West  and  Harrison  1989.)  Before  considering  these,  we  need  a  slight 
modification  of  (9.1.1)  and  (9.1.2),  which  allows  for  series  in  which  the  time  index 
runs  from  —  oo  to  oo.  This  is  a  more  natural  formulation  for  many  time  series  models. 


9.1 .1  State-Space  Models  with  t  e  {0,  ±1 , . . . } 

Consider  the  observation  and  state  equations 

Yt  =  GXt  +  W„  t  =  0,  ±1, ... ,  (9.1.10) 

X,+1  =  FXt  +  Yh  t  =  0,dbl,...,  (9.1.11) 

where  F  and  G  are  v  x  v  and  w  x  v  matrices,  respectively,  {V/}  ~  WN(0,  Q ),  {WJ  ~ 
WN(0,  R ),  and  E(YsW't)  =  0  for  all  s,  and  t. 

The  state  equation  (9.1.11)  is  said  to  be  stable  if  the  matrix  F  has  all  its  eigen¬ 
values  in  the  interior  of  the  unit  circle,  or  equivalently  if  det (/  —  Fz)  /  0  for  all  z 
complex  such  that  |z|  <  1.  The  matrix  F  is  then  also  said  to  be  stable. 

In  the  stable  case  equation  (9.1.11)  has  the  unique  stationary  solution  (Prob¬ 
lem  9.1)  given  by 

oo 

x,  =  EFV^-1- 

J=0 

The  corresponding  sequence  of  observations 

oo 

Yt  =  W,  +  GF\t-j-x 

j= 0 


is  also  stationary. 
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A  structural  time  series  model,  like  the  classical  decomposition  model  defined  by 
(1.5.1),  is  specified  in  terms  of  components  such  as  trend,  seasonality,  and  noise, 
which  are  of  direct  interest  in  themselves.  The  deterministic  nature  of  the  trend 
and  seasonal  components  in  the  classical  decomposition  model,  however,  limits  its 
applicability.  A  natural  way  in  which  to  overcome  this  deficiency  is  to  permit  random 
variation  in  these  components.  This  can  be  very  conveniently  done  in  the  framework 
of  a  state-space  representation,  and  the  resulting  rather  flexible  model  is  called  a 
structural  model.  Estimation  and  forecasting  with  this  model  can  be  encompassed  in 
the  general  procedure  for  state-space  models  made  possible  by  the  Kalman  recursions 
of  Section  9.4. 

Example  9.2.1  The  Random  Walk  Plus  Noise  Model 

One  of  the  simplest  structural  models  is  obtained  by  adding  noise  to  a  random  walk. 
It  is  suggested  by  the  nonseasonal  classical  decomposition  model 

Yt  =  M,  +  W„  where  {Wt}  ~  WN  (0,  ,  (9.2.1) 

and  Mt  —  mt ,  the  deterministic  “level”  or  “signal”  at  time  t.  We  now  introduce 
randomness  into  the  level  by  supposing  that  Mt  is  a  random  walk  satisfying 

Mt+l=Mt  +  V„  and  {Vt}  ~  WN  (0,  cr,2)  ,  (9.2.2) 

with  initial  value  M\  —  m\.  Equations  (9.2.1)  and  (9.2.2)  constitute  the  “local  level”  or 
“random  walk  plus  noise”  model.  Figure  9-1  shows  a  realization  of  length  100  of  this 
model  with  M\  —  0,  a2  =  4,  and  a2  =  8.  (The  realized  values  mt  of  Mt  are  plotted  as 
a  solid  line,  and  the  observed  data  are  plotted  as  square  boxes.)  The  differenced  data 


Dt  :=  VF,  —  Yt—  =  V t—\  +  Wt-  Wt-U  t  >  2, 


constitute  a  stationary  time  series  with  mean  0  and  ACF 

.2 

if  W  =  1, 


I 


—a 


w 


PD(h)  =  4  2or 2  +  <7 2 
0, 


if  \h\  >  1. 


Figure  9-1 

Realization  from  a  random 
walk  plus  noise  model. 

The  random  walk  is 
represented  by  the  solid 
line  and  the  data  are 
represented  by  boxes 
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Figure  9-2 

Sample  ACF  of  the  series 
obtained  by  differencing 
the  data  in  Figure  9-1 


0  10  20  30  40 

Lag 


Since  {Dt}  is  1 -correlated,  we  conclude  from  Proposition  2.1.1  that  {Dt}  is  an  MA(1) 
process  and  hence  that  {Yt}  is  an  ARIMA(0,1,1)  process.  More  specifically, 


D,=Z,  +  {Z(}~WN(0,ff2), 

where  9  and  a2  are  found  by  solving  the  equations 


9 


—a 


w 


i  +  e2 


2cr2  — o'  — 
^uw  '  uv 


and  9cr 2  =  —a2. 


w 


(9.2.3) 


For  the  process  {FJ  generating  the  data  in  Figure  9-1,  the  parameters  9  and  a2  of 
the  differenced  series  {Dt}  satisfy  9/(1  +  92)  =  —0.4  and  9 cr2  =  —8.  Solving  these 
equations  for  9  and  a2,  we  find  that  9  =  —0.5  and  a2  —  16  (or  9  =  —2  and  a2  —  4). 
The  sample  ACF  of  the  observed  differences  Dt  of  the  realization  of  {Fr}  in  Figure  9-1 
is  shown  in  Figure  9-2. 

The  local  level  model  is  often  used  to  represent  a  measured  characteristic  of  the 
output  of  an  industrial  process  for  which  the  unobserved  process  level  { Mt }  is  intended 
to  be  within  specified  limits  (to  meet  the  design  specifications  of  the  manufactured 
product).  To  decide  whether  or  not  the  process  requires  corrective  attention,  it  is 
important  to  be  able  to  test  the  hypothesis  that  the  process  level  {Mt}  is  constant.  From 
the  state  equation,  we  see  that  {Mt}  is  constant  (and  equal  to  mi)  when  Vt  =  0  or 
equivalently  when  a2  =  0.  This  in  turn  is  equivalent  to  the  moving-average  model 
(9.2.3)  for  { Dt }  being  noninvertible  with  9  —  —  1  (see  Problem  8.2).  Tests  of  the  unit 
root  hypothesis  9  —  —  1  were  discussed  in  Section  6.3.2. 

□ 

The  local  level  model  can  easily  be  extended  to  incorporate  a  locally  linear  trend 
with  slope  pt  at  time  t.  Equation  (9.2.2)  is  replaced  by 


Mt  —  Mt_  i  +  Bt_  i  +  Vt-\, 


(9.2.4) 


where  Bt_\  —  Now  if  we  introduce  randomness  into  the  slope  by  replacing  it 
with  the  random  walk 

Bt  =  Bt_i  +  Ut-i,  where  [Ut)  ~  WN  (0,  erH2)  ,  (9.2.5) 

we  obtain  the  “local  linear  trend”  model. 

To  express  the  local  linear  trend  model  in  state-space  form  we  introduce  the  state 
vector 
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X,  =  (Mt,  Bty. 


Then  (9.2.4)  and  (9.2.5)  can  be  written  in  the  equivalent  form 


1  1 
0  1 


X,  +  v„ 


(9.2.6) 


where  V,  =  ( V', .  U,)' .  The  process  j  Y,  \  is  then  determined  by  the  observation  equation 


Y,  =  [1  0]  X,  +  Wt. 


(9.2.7) 


IfjXi,  U  i,  V\,  W\,  6/  2 ,  V2,  W2,  ■ . . }  is  an  uncorrelated  sequence,  then  equations  (9.2.6) 
and  (9.2.7)  constitute  a  state-space  representation  of  the  process  {Yt},  which  is  a  model 
for  data  with  randomly  varying  trend  and  added  noise.  For  this  model  we  have  v  = 
2,  w  =  1, 


"i  r 

IV  0] 

F  = 

.0  i,_ 

G  =  [  1  0],  Q  = 

v 

.  0  ol_ 

and  R  =  a. 


W* 


A  Seasonal  Series  with  Noise 


The  classical  decomposition  (1.5.11)  expressed  the  time  series  {Xr}  as  a  sum  of  trend, 
seasonal,  and  noise  components.  The  seasonal  component  (with  period  d )  was  a 
sequence  {st}  with  the  properties  st+d  —  s1  and  Y^=\  st  —  0.  Such  a  sequence  can 
be  generated,  for  any  values  of  st,  sq,  . . . ,  S-j+3,  by  means  of  the  recursions 

—  Sf  ■  Sf—d- 1-2?  ^  —  1,2,....  (9.2.8) 

A  somewhat  more  general  seasonal  component  {Tr},  allowing  for  random  deviations 
from  strict  periodicity,  is  obtained  by  adding  a  term  St  to  the  right  side  of  (9.2.8),  where 
{Vt}  is  white  noise  with  mean  zero.  This  leads  to  the  recursion  relations 

Yt+i  =  -Yt - Yt_d+i  +  St,  t=  1,2,....  (9.2.9) 

To  find  a  state-space  representation  for  {Fr}  we  introduce  the  (« d —  1) -dimensional  state 
vector 


Xf  =  (Yt,  Yt-i,  . . . ,  Yf-d+i)' • 

The  series  {Yt}  is  then  given  by  the  observation  equation 

Yt  —  [  1  0  0  •  •  •  0]  t  —  1,2,..., 
where  {Xr}  satisfies  the  state  equation 

Xm  =  F\t  +  \t,  t  —  1,2..., 

Vf  =  (St,  0, . . . ,  0)r,  and 


-1  -1 

1  0 

0  1 


-1  -1 

0  0 

0  0 


0  0  •••  1  0 


(9.2.10) 

(9.2.11) 


(9.2.12) 

□ 


Example  9.2.3  A  Randomly  Varying  Trend  with  Random  Seasonality  and  Noise 
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A  series  with  randomly  varying  trend,  random  seasonality  and  noise  can  be  constructed 
by  adding  the  two  series  in  Examples  9.2.1  and  9.2.2.  (Addition  of  series  with  state- 
space  representations  is  in  fact  always  possible  by  means  of  the  following  construction. 
See  Problem  9.9.)  We  introduce  the  state  vector 


where  Xj  and  X2  are  the  state  vectors  in  (9.2.6)  and  (9.2.1 1).  We  then  have  the  follow¬ 
ing  representation  for  {Fr},  the  sum  of  the  two  series  whose  state-space  representations 
were  given  in  (9.2.6)-(9.2.7)  and  (9.2.10)-(9.2.11).  The  state  equation  is 


~F  i 

O' 

x,+ 

rvn 

’  t 

_0 

f2_ 

— - 

Lv?J 

(9.2.13) 


where  F i,  F 2  are  the  coefficient  matrices  and  {V*},  { V^}  are  the  noise  vectors  in  the 
state  equations  (9.2.6)  and  (9.2.11),  respectively.  The  observation  equation  is 


Yt  —  [1  0  1  0  •  •  •  0]  Xt  +  Wt , 


(9.2.14) 


where  {Wt}  is  the  noise  sequence  in  (9.2.7).  If  the  sequence  of  random  vectors 
{Xi,  V},  V|,  W\,  Y\,  V2,  W2,  . . .}  is  uncorrelated,  then  equations  (9.2.13)  and  (9.2.14) 
constitute  a  state-space  representation  for  {Yt}. 


□ 


9.3  State-Space  Representation  of  ARIMA  Models 

We  begin  by  establishing  a  state-space  representation  for  the  causal  AR (p)  process  and 
then  build  on  this  example  to  find  representations  for  the  general  ARMA  and  ARIMA 
processes. 

Example  9.3.1  State-Space  Representation  of  a  Causal  AR (p)  Process 


Consider  the  AR (p)  process  defined  by 

fjr+1  —  01  Yt  +  02^-1  +  *••+  (ppYt-p+i  +  Zj+i,  t  =  0,  d=l,  . . . ,  (9.3.1) 

where  {Zt}  ~  WN(0,  a2),  and  </>(z)  :=  1  —  </>iz - 4>pzp  is  nonzero  for  |z|  <  1.  To 

express  {Fr}  in  state-space  form  we  simply  introduce  the  state  vectors 


X;  — 


Yf—p+i 

Yt—p+2 

Yt, 


t  =  0,  ibl,  .  .  .  . 


(9.3.2) 


From  (9.3.1)  and  (9.3.2)  the  observation  equation  is 


Yt  =  [0  0  0  •••  1]X„  r  =  0,  ±1 


•  •  • 


(9.3.3) 
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while  the  state  equation  is  given  by 


01  0  •••  0 

0 

00  1  •••  0 

0 

X,+1  = 

•  •  •  #  • 

•  •  •  •  • 

•  •  •  *  • 

00  0  •••  1 

Xr  + 

• 

• 

• 

0 

Zt+ 1,  t  —  0,  ±1, 

1 

<N 

1 

7 

_ 1 

_1_ 

(9.3.4) 


These  equations  have  the  required  forms  (9.1.10)  and  (9.1.11)  with  Wt  =  0  and  \t  = 
(0,  0,  . . . ,  t  —  0,  ±1, - 


□ 


Remark  1.  In  Example  9.3.1  the  causality  condition  <fi(z)  7^0for|z|  <  1  is  equivalent 
to  the  condition  that  the  state  equation  (9.3.4)  is  stable,  since  the  eigenvalues  of 
the  coefficient  matrix  in  (9.3.4)  are  simply  the  reciprocals  of  the  zeros  of  </>(z) 
(Problem  9.3).  □ 


Remark  2.  If  equations  (9.3.3)  and  (9.3.4)  are  postulated  to  hold  only  for  t  — 
1,2,...,  and  if  Xi  is  a  random  vector  such  that  {Xi,  Zi,  Z2,  . . .}  is  an  uncorrelated 
sequence,  then  we  have  a  state-space  representation  for  { Yt }  of  the  type  defined 
earlier  by  (9.1.1)  and  (9.1.2).  The  resulting  process  {Yt}  is  well-defined,  regardless 
of  whether  or  not  the  state  equation  is  stable,  but  it  will  not  in  general  be  stationary. 
It  will  be  stationary  if  the  state  equation  is  stable  and  if  Xi  is  defined  by  (9.3.2)  with 
Yt  =  V' 'jZt-j ,  t  -  1,  0, . . . ,  2  -  p,  and  f(z)  =  l/0(z),  |z|  <  1.  □ 


State-Space  Form  of  a  Causal  ARMA (p,  q )  Process 


State-space  representations  are  not  unique.  Here  we  shall  give  one  of  the  (infinitely 
many)  possible  representations  of  a  causal  ARMA(p,g)  process  that  can  easily  be 
derived  from  Example  9.3.1.  Consider  the  ARMA(p,g)  process  defined  by 

cj){B)Yt  =  9{B)Zt ,  t  =  0,  ±1,  ... ,  (9.3.5) 

where  [Zt]  ~  WN(0,  a2)  and  7^  0  for  |z|  <  1.  Let 

r  =  max(p,  q  +  1),  c pj  =  0  for  j  >  /?,  9j  =  0  for  j  >  q ,  and  6q  =  1. 

If  {Uj}  is  the  causal  AR (p)  process  satisfying 


</>(B)Ut  =  Zt, 
then  Yt  =  6(B)Ut ,  since 


<KB)Yt  =  <KB)0(B)Ut  =  0{B)4>{B)Ut  =  9(B)Zt. 


Consequently, 


Yt  =  [dr- 1  9r- 2 


^o]X„ 


where 


X;  — 


Ut— r+1 
U t-r-\-  2 

ut 


(9.3.6) 


(9.3.7) 


(9.3.8) 
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Example  9.3.3 


But  from  Example  9.3.1  we  can  write 


01  0  •••  0 

0 

00  1  •••  0 

0 

X,+1  = 

•  •  •  .  • 

•  •  •  •  • 

•  •  •  *  • 

0  0  0  •••  1 

Xr  + 

• 

• 

• 

0 

Zt+i,  t  —  0,  ±1, 

1 

<N 

1 

7 

_ 1 

_1_ 

(9.3.9) 

Equations  (9.3.7)  and  (9.3.9)  are  the  required  observation  and  state  equations.  As  in 
Example  9.3.1,  the  observation  and  state  noise  vectors  are  again  Wr  =  0  and  \t  = 

(o,  o, . . . ,  zr+iy,  t  —  0,  ±1,  .... 


State-Space  Representation  of  an  ARIMA (/?,  d,  q )  Process 


If  {Fr}  is  an  ARIMA (/?,  d,  q)  process  with  {VJF^}  satisfying  (9.3.5),  then  by  the 
preceding  example  { 'VdYt]  has  the  representation 

VdYt  =  GXt,  t  =  0,  ±1, ... ,  (9.3.10) 

where  {Xr}  is  the  unique  stationary  solution  of  the  state  equation 
Xf+1  =  FXt  +  \t, 

F  and  G  are  the  coefficients  of  Xt  in  (9.3.9)  and  (9.3.7),  respectively,  and  \t  = 
(0,  0,  . . . ,  Zt+ 1)'.  Let  A  and  B  be  the  d  x  1  and  d  x  d  matrices  defined  by  A  =  B  =  1 
if  d  =  1  and 


0 

0 

0  1  0  •••  0 

0  0  1  •••  0 

A  = 

• 

• 

• 

0 

_1_ 

,  B  = 

•  •  •  #  • 

•  •  •  •  • 

•  •  •  *  • 

0  0  0  ...  1 

J-D" (2)  (-irV.)  (-D'-'Gy  ■■■  d- 

if  d  >  1 .  Then  since 


the  vector 


(9.3.11) 


satisfies  the  equation 

Y,  =  AVdYt  +  SY,_!  =  AGX,  +  BYt_\. 


Defining  a  new  state  vector  Tr  by  stacking  X,  and  Y,_| ,  we  therefore  obtain  the  state 
equation 


F  0 
AG  B 


Tr  + 


(9.3.12) 
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and  the  observation  equation,  from  (9.3.10)  and  (9.3.11), 


with  initial  condition 


X,- 

-  OO  - 

E  fj  y-j 

Y0 

j= 0 

1 

O 

_ 1 

and  the  assumption 

E(Y0Z't)  =  0,  t  =  0,  ±1, . . . , 


Xf 

Y,-i 


(9.3.13) 


(9.3.14) 


(9.3.15) 


where  Yo  =  (F i_</,  F2_(/,  . . . ,  F,)'.  The  conditions  (9.3.15),  which  are  satisfied  in 
particular  if  Yo  is  considered  to  be  nonrandom  and  equal  to  the  vector  of  observed 
values  (y\-d,  •  •  • ,  Vn)\  are  imposed  to  ensure  that  the  assumptions  of  a  state- 

space  model  given  in  Section  9. 1  are  satisfied.  They  also  imply  that  E  (Xi  Yp  =  0  and 
E(Y0VdY't)  =  0,  t  >  1,  as  required  earlier  in  Section  6.4  for  prediction  of  ARIMA 
processes. 

State-space  models  for  more  general  ARIMA  processes  (e.g.,  {Yt}  such  that 
{ V  Vi2F?}  is  an  ARMA (p,  q)  process)  can  be  constructed  in  the  same  way.  See  Problem 
9.4. 

□ 

For  the  ARIMA(1,  1,  1)  process  defined  by 

(1  -  (j)B){  1  -  B)Yt  =  (1  +  9B)Zt,  {Z,}  ~  WN  (0,  a2)  , 

the  vectors  Xt  and  Yf_i  reduce  to  Xt  =  ( Xt_i,Xt )'  and  Yt_i  =  Yt-\.  From  (9.3.12) 
and  (9.3.13)  the  state-space  representation  is  therefore  (Problem  9.8) 


Y,=  [9  1  1] 


V_, 

V  , 
Yt- 1_ 


where 


and 


xt " 

"0 

1 

0" 

~x,-r 

0 

V+1 

— 

0 

<P 

0 

x , 

+ 

Z,+I 

Yt  _ 

_9 

1 

1_ 

_  0  _ 

X0 

Xi 

Jo_ 


oo 


E  4YZ.j 

j= 0 

OO 

E  ¥Z\-j 

j= 0 


(9.3.16) 


(9.3.17) 


(9.3.18) 
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9.4  The  Kalman  Recursions 

In  this  section  we  shall  consider  three  fundamental  problems  associated  with  the  state- 
space  model  defined  by  (9.1.1)  and  (9.1.2)  in  Section  9.1.  These  are  all  concerned  with 
finding  best  (in  the  sense  of  minimum  mean  square  error)  linear  estimates  of  the  state- 
vector  in  terms  of  the  observations  Yi,  Y2,  . . . ,  and  a  random  vector  Yq  that  is 
orthogonal  to  \t  and  Wt  for  all  t  >  1.  In  many  cases  Yq  will  be  the  constant  vector 
(1,1,...,  1  y.  Estimation  of  Xt  in  terms  of: 

a.  Y0, . . . ,  Yr_i  defines  the  prediction  problem, 

b.  Y0,  . . . ,  Yt  defines  the  filtering  problem, 

c.  Y0,  . . . ,  Yn  ( n  >  t)  defines  the  smoothing  problem. 

Each  of  these  problems  can  be  solved  recursively  using  an  appropriate  set  of  Kalman 
recursions,  which  will  be  established  in  this  section. 

In  the  following  definition  of  best  linear  predictor  (and  throughout  this  chapter) 
it  should  be  noted  that  we  do  not  automatically  include  the  constant  1  among  the 
predictor  variables  as  we  did  in  Sections  2.5  and  8.5.  (It  can,  however,  be  included 
by  choosing  Yq  =  (1,  1, . . . ,  1)'.) 


Definition  9.4.1 


For  the  random  vector  X  =  (X\ ,  . . . ,  Xv)\ 

Pt(X)  :=  (Pt(Xi),...,Pt(Xv)Y, 

where  Pt(Xi)  :=  P(X/|Y0,  Yls  . . . ,  Yt),  is  the  best  linear  predictor  of  X;  in  terms 
of  all  components  of  Yq,  Yi,  . . . ,  Y*. 


Remark  1.  By  the  definition  of  the  best  predictor  of  each  component  Xt  of  X, 
Pt(X)  is  the  unique  random  vector  of  the  form 

Pt(X)  =  A0Y0  +  •  •  •  +  AtYt 

with  v  x  w  matrices  A0,  . . . ,  At  such  that 

[X  -  Pt(X)]  _L  Ys,  s  =  0,  . . . ,  f 

[cf.  (8.5.2)  and  (8.5.3)].  Recall  that  two  random  vectors  X  and  Y  are  orthogonal 
(written  X  _L  Y)  if  2?(XY')  is  a  matrix  of  zeros.  □ 

Remark  2.  If  all  the  components  of  X,  Yi,  . . . ,  Yt  are  jointly  normally  distributed 
and  Yo  =  (1, . . . ,  1)',  then 

Pt(X)=E(X  |Y1?...,Yf),  t>  1.  □ 


Remark  3.  Pt  is  linear  in  the  sense  that  if  A  is  any  k  x  v  matrix  and  X,  V  are  two 
v- variate  random  vectors  with  finite  second  moments,  then  (Problem  9.10) 

Pt(AX)  =  APt(X ) 
and 

P  r(X  +  Y)  =  P  r(X)  +  P  t(V). 

□ 
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Remark  4.  If  X  and  Y  are  random  vectors  with  v  and  w  components,  respectively, 
each  with  finite  second  moments,  then 

P(X|Y)  =  MY, 

where  M  is  a  v  x  w  matrix,  M  —  £,(XY/)[£’(YY/)]  1  with  [^(YY7)]  1  any  generalized 
inverse  of  E(YY').  (A  generalized  inverse  of  a  matrix  S  is  a  matrix  S~l  such  that 
SS~lS  =  S.  Every  matrix  has  at  least  one.  See  Problem  9.1 1.) 

In  the  notation  just  developed,  the  prediction,  filtering,  and  smoothing  problems 
(a),  (b),  and  (c)  formulated  above  reduce  to  the  determination  of  Pt- \(Xt),  Pt(Xt),  and 
Pn&t)  (n  >  t ),  respectively.  We  deal  first  with  the  prediction  problem.  □ 


Kalman  Prediction: 

A 

For  the  state-space  model  (9. 1 . 1)— (9. 1 .2),  the  one-step  predictors  Xt  :=  Pt_ i(X?) 
and  their  error  covariance  matrices  Qt  —  £[(Xr  —  Xr)(Xr  —  Xr)  ]  are  uniquely 
determined  by  the  initial  conditions 

x,  =  P(X, |Y0),  S2y  =  £[(Xi  -  X,) (X,  -  Xy)'] 

and  the  recursions,  for  t  =  1 , . . . , 

X,+  1  =  fA  +  0,4-'  (t,  -  G,x,)  .  (9.4.1) 

nt+l  =  FtQtF\  +  Q,  -  (9.4.2) 

where 

At  =  GtC2tG\  +  Rt , 

®t  =  FMG[, 

and  A~l  is  any  generalized  inverse  of  At. 


Proof.  We  shall  make  use  of  the  innovations  lt  defined  by  Iq  =  Yq  and 

l,  =  Yt-  P,_y  Yt  =  Yt-  GtXt  =  G,  (xt  -  Xr)  +  wt,  t  =1,2,  .... 

The  sequence  {lt}  is  orthogonal  by  Remark  1.  Using  Remarks  3  and  4  and  the  relation 
Pt(-)  =  Pt-i(-)  +  (9.4.3) 

(see  Problem  9.12),  we  find  that 

Xf+i  =  Pt-  i(Xf+1)  +  P(Xt+l\lt)  =  P  t-i(FtXt  +  Yt)  +  ®tA;llt 

=  Ft±,  +  ®,A-%  (9.4.4) 

where 


At  =  E(l,  ft)  =  G,S2,Gft  +  Rt, 


=  E(Xt+l  ft)  =  E 


(E,xt  +  \,)  ^[x,-x,]'g;  +  w; 


=  Ft£2tGt. 


To  verify  (9.4.2),  we  observe  from  the  definition  of  Qt+ 1  that 
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Example  9.4.1. 


With  (9.1.2)  and  (9.4.4)  this  gives 


Qt+l  =  FtE(XtX't)F[  +  Qt-  F,E  (xfX'r)  F\  -  ®tA~x®t 


=  Fta,Ft  +  Qt  —  &tA;l&r 


9.4.1  h- Step  Prediction  of  {Y t}  Using  the  Kalman  Recursions 


The  Kalman  prediction  equations  lead  to  a  very  simple  algorithm  for  recursive 
calculation  of  the  best  linear  mean  square  predictors  PtYt+h,  h  =  1,2,....  From 
(9.4.4),  (9.1.1),  (9.1.2),  and  Remark  3  in  Section  9.1,  we  find  that 


t+i 


FtPt_l\t  +  &tA;l(Yt-Pt-  iYf), 


(9.4.5) 


P t^t+h  ~  P • 


t+h — 1 


p,x 


t+h —  1 


=  (P t+h—  1 P t+h—2  *  *  *  Pt+ 1)  PtXt+ 1,  h  —  2,  3,  ...  , 


(9.4.6) 


and 


Pt^t+h  —  Gt+hPfX 


t+h  • 


From  the  relation 


(9.4.7) 


Xf+h  Pf^t+h  —  Pt+h—lQ^t+h—l  Pt^t+h—  l)  T  ^  t+h—  i,  h  —  2,  3,  ...  , 
we  find  that  :=  E[(Xt+h  —  PtXt+h)(Xt+h  —  PtXt+hY]  satisfies  the  recursions 

^h)  =  Ft+h^h-l)F,t+h_l  +  Qt+h_u  h  =  2,3,...,  (9.4.8) 

with  i2f(1)  =  Qt+ 1.  Then  from  (9.1.1)  and  (9.4.7),  A\h)  :=  E\(Yt+h  -  P, Yt+h)(Yt+h  - 
P, Yt+h)']  is  given  by 

Af  =  Gt+hQ^G't+h  +  Rt+h,  h  =  1,2,....  (9.4.9) 


Consider  the  random  walk  plus  noise  model  of  Example  9.2.1  defined  by 
Yt  =  Xt  +  Wt,  {W?}~WN(0,a2), 
where  the  local  level  Xt  follows  the  random  walk 
Xt+1  =  Xr  +  v„  {Vt}  ~  WN  (0,  a2) . 

Applying  the  Kalman  prediction  equations  with  Yq  :=  1,  R  =  an^  Q  =  we 
obtain 

Yt+ 1  =  PtYt+i  =Xt  +  ^(Yt-  Yt J 
=  (1  -  at)Yt  +  atYt 

where 

&t  fit 
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For  a  state-space  model  (like  this  one)  with  time-independent  parameters,  the  solution 
of  the  Kalman  recursions  (9.4.2)  is  called  a  steady-state  solution  if  Qt  is  independent 
of  t.  If  fit  =  £2  for  all  t ,  then  from  (9.4.2) 


—  4 $2  —  i $2  cf  — 


£2  a. 


w 


a  +  al  + 


+  aw 

Solving  this  quadratic  equation  for  £2  and  noting  that  £2  >  0,  we  find  that 


Since  £2t+ i  —  £2t  is  a  continuous  function  of  Qt  on  Qt  >  0,  positive  at  Qt  —  0, 
negative  for  large  Qt,  and  zero  only  at  £2t  —  £2,  it  is  clear  that  £2t+\  —  £2t  is  negative 
for  fit  >  £2  and  positive  for  £2t  <  £2.  A  similar  argument  shows  (Problem  9.14)  that 
(^f+i  —  ~  >  0  for  all  Qt  >  0.  These  observations  imply  that  £2t+\  always 

falls  between  £2  and  Qt.  Consequently,  regardless  of  the  value  of  £2\,  £2t  converges 

A  A 

to  *$2,  the  unique  solution  of  Qt+\  —  &t-  For  any  initial  predictors  Y\  —  X\  and  any 

initial  mean  squared  error  —  E(X\  —  Xi)2,  the  coefficients  at  £2t/  (£2t  +  a2) 
converge  to 


£2 


a  — 


and  the  mean  squared  errors  of  the  predictors  defined  by 
Yt+l  =  (1  -  at)Yt  +  atYt 
converge  to  Y2  +  a2. 

If,  as  is  often  the  case,  we  do  not  know  £2\,  then  we  cannot  determine  the  sequence 
{at}.  It  is  natural,  therefore,  to  consider  the  behavior  of  the  predictors  defined  by 

Yt+i  =  (1  -a)Yt  +  oYt 

A 

with  a  as  above  and  arbitrary  Y\ .  It  can  be  shown  (Problem  9.16)  that  this  sequence 
of  predictors  is  also  asymptotically  optimal  in  the  sense  that  the  mean  squared  error 
converges  to  £2  +  a2  as  t  ->  oo. 

As  shown  in  Example  9.2.1,  the  differenced  process  Dt  —  Yt  —  Yt_\  is  the  MA(1) 
process 

Dt  —  Z,  +  9Zt_u  {Ztj  ~  WN  (0,  a2)  , 

where  9/  (l  +  0 2)  =  — cr2/  (2 a2  +  a2).  Solving  this  equation  for  0  (Problem  9.15), 
we  find  that 


9  =  — 


1 


2cr 2 

w 


2al  +  °v  -  \  °v  +  4°v°l 


and  that  9  —  a  —  1. 

A 

It  is  instructive  to  derive  the  exponential  smoothing  formula  for  Yt  directly  from 
the  ARIMA(0,1,1)  structure  of  {Yt}.  For  t  >  2,  we  have  from  Section  6.5  that 


Yt+i  =  Yt  +  9t\(Yt  -  Yt)  =  —9t\Yt  +  (1  +  0tl)Yt 


for  t  >  2,  where  9t\  is  found  by  application  of  the  innovations  algorithm  to  an  MA(1) 
process  with  coefficient  9.  It  follows  that  1  —  at  —  —9t i,  and  since  9t \  ->  9 
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(see  Remark  1  of  Section  3.3)  and  at  converges  to  the  steady-state  solution  a ,  we 
conclude  that 

1  —  a  —  lim  (1  —  at)  —  —  lim  6t\  —  —6. 

t—>  OO  /■— >■  oo 

□ 


Example  9.4.2.  The  lognormal  stochastic  volatility  model 


We  can  rewrite  the  defining  equations  (7.4.2)  and  (7.4.3)  of  the  lognormal  SV  process 
{Z,}  in  the  following  state-space  form 

X,  =  nV-!  +  T}t,  (9.4.10) 

and 


Y,=X,  +  et, 


(9.4.11) 


where  the  (one-dimensional)  state  and  observation  vectors  are 

Yo 


X,  =  lt- 


1  -  Y\ 


(9.4.12) 


and 


Y,  =  In  Z]  +  1.27  - 


To 


(9.4.13) 


2(1  -  n) 

respectively.  The  independent  white-noise  sequences  { rjt }  and  {sj  have  zero  means 
and  variances  a2  and  4.93  respectively. 

Taking 

X0  =  EX 0  =  0  (9.4.14) 

and 


X20  =  Var(Xo)  =  a2/(l  -  ft2), 


(9.4.15) 


and  we  can  directly  apply  the  Kalman  prediction  recursions  (9.4.1),  (9.4.2),  (9.4.6)  and 
(9.4.8),  to  compute  recursively  the  best  linear  predictor  of  Xt+h  in  terms  of  { Ys ,  s  <  t }, 
or  equivalently  of  the  log  volatility  tt+h  in  terms  of  the  observations  {InZ2,  s  <  t}. 


□ 


Kalman  Filtering: 

The  filtered  estimates  Xt\t  =  Pt(Xt)  and  their  error  covariance  matrices  Qt\t  — 
E[(Xt  —  X?|r)(Xr  —  X^)7]  are  determined  by  the  relations 

P,Xt  =  Pt_ iXf  +  P2tG\A-x  (y,  -  GtXt)  (9.4.16) 

and 

P2tv  =  Qt-  QtG\A-xGtQ't.  (9.4.17) 


Proof.  From  (9.4.3)  it  follows  that 

PtXt  =  P  f—{Xf  +  MU, 


M  =  E(X,  l'r)\E(lr  i;)]-1  =  £[Xr(Gr(X?  -  %)  +  Wt)’]A;x 


ntG'tA;1. 

(9.4.18) 


where 
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To  establish  (9.4.17)  we  write 

X/  —  P  t_  \Xt  =  Xt  —  P 'tXt  +  P /X/  —  Pr_iX?  =  Xt  —  P^X?  +  MIj. 

Using  (9.4.18)  and  the  orthogonality  of  Xt  —  PtXt  and  MIt,  we  find  from  the  last 
equation  that 

Qt  —  &t\t  +  lGtC2[, 

as  required.  ■ 


Kalman  Fixed-Point  Smoothing: 

The  smoothed  estimates  Xt\n  —  PnXt  and  the  error  covariance  matrices  Qt\n  — 
E[(Xt  —  Xf|„)(Xf  —  Xt\n )']  are  determined  for  fixed  t  by  the  following  recursions, 
which  can  be  solved  successively  for  n  —  t,  t  +  1,  . . .: 


with  initial  conditions  Pt-\Xt  —  Xt  and  Qut  —  Qt\t- 1 
prediction). 


(9.4.19) 

(9.4.20) 

(9.4.21) 

=  Qt  (found  from  Kalman 


__  __  /  yv  \ 

Proof.  Using  (9.4.3)  we  can  write  PnXt  =  P„_|  X,  +  Cl„,  where  I„  =  G„(X„  —  X„)  +  W„. 
By  Remark  4  above, 


C  =  E 


x, r  Gn(Xn  -  X„)  +  W 


ft 


(9.4.22) 


where  :=  F[(Xr  —  Xr)(Xw  —  Xn)  ].  It  follows  now  from  (9.1.2),  (9.4.5),  the 

A 

orthogonality  of  Yn  and  W/?  with  Xt  —  Xh  and  the  definition  of  E2Ul  that 


^t,n+ 1  — E 


(xf  -  x,)  (x„  -  X„)'  (Fn  -  ®nA~lGn )'  =£2t<n  [F„  -  &nA~lGn ]' , 


thus  establishing  (9.4.20).  To  establish  (9.4.21)  we  write 


x,  —  PnXt  —  Xt  —  Pn- \Xt  —  CIn. 

Using  (9.4.22)  and  the  orthogonality  of  Xt  —  PnXt  and  I„,  the  last  equation  then  gives 

^t\n  —  ^t\n—  1  ^ t,nGnAn  GnE2tn,  fl  —  t,  t  ~ |~1,..., 

as  required.  ■ 


9.5  Estimation  for  State-Space  Models 

Consider  the  state-space  model  defined  by  equations  (9.1.1)  and  (9.1.2)  and  suppose 
that  the  model  is  completely  parameterized  by  the  components  of  the  vector  6.  The 
maximum  likelihood  estimate  of  6  is  found  by  maximizing  the  likelihood  of  the  obser¬ 
vations  Yi, . . . ,  Y„  with  respect  to  the  components  of  the  vector  6.  If  the  conditional 
probability  density  of  Yt  given  Y,_i  =  yt-u  •  •  • ,  Y0  =  y0  is/,(-|y,_i, . . . ,  y0),  then 
the  likelihood  of  Y  ut  —  1,  . . . ,  n  (conditional  on  Yq),  can  immediately  be  written  as 


,Y„)  =  f]/*CY,|Yf_1, 

t= l 


L(0\  Yu 


•  •  • 


•  •  • 


(9.5.1) 
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The  calculation  of  the  likelihood  for  any  fixed  numerical  value  of  6  is  extremely 
complicated  in  general,  but  is  greatly  simplified  if  Yq,  Xi  and  Wr,  Vr,  t  =  1,2,..., 
are  assumed  to  be  jointly  Gaussian.  The  resulting  likelihood  is  called  the  Gaussian 
likelihood  and  is  widely  used  in  time  series  analysis  (cf.  Section  5.2)  whether  the  time 
series  is  truly  Gaussian  or  not.  As  before,  we  shall  continue  to  use  the  term  likelihood 
to  mean  Gaussian  likelihood. 

If  Yq,  Xi  and  Wr,  \t,  t  =  1,  2,  . . . ,  are  jointly  Gaussian,  then  the  conditional 
densities  in  (9.5.1)  are  given  by 


/r(Y,|Y 


• ,  Y0)  =  (2 7r)  w/2  (det A)  1/2  exp 


where  lt  =  Yt  —  Pt_iYt  =  Yt  —  GXt ,  Pt_\Yt,  and  At,  t  >  1,  are  the  one-step  pre¬ 
dictors  and  error  covariance  matrices  found  from  the  Kalman  prediction  recursions. 
The  likelihood  of  the  observations  Yi,  . . . ,  Yn  (conditional  on  Yq)  can  therefore  be 
expressed  as 


L(0;  Y1?  . . . ,  Y„)  =  (2tt) 


—nw/ 2 


-1/2 


exp 


n 


7=1 


(9.5.2) 


Given  the  observations  Yi, . . . ,  Y„,  the  distribution  of  Yq  (see  Section  9.4),  and  a 
particular  parameter  value  0,  the  numerical  value  of  the  likelihood  L  can  be  computed 
from  the  previous  equation  with  the  aid  of  the  Kalman  recursions  of  Section  9.4.  To 
find  maximum  likelihood  estimates  of  the  components  of  0,  a  nonlinear  optimization 
algorithm  must  be  used  to  search  for  the  value  of  6  that  maximizes  the  value  of  L. 

Having  estimated  the  parameter  vector  0,  we  can  compute  forecasts  based  on  the 
fitted  state-space  model  and  estimated  mean  squared  errors  by  direct  application  of 
equations  (9.4.7)  and  (9.4.9). 


9.5.1  Application  to  Structural  Models 

The  general  structural  model  for  a  univariate  time  series  {Yt}  of  which  we  gave 
examples  in  Section  9.2  has  the  form 

Yt  =  GX,  +  W„  {Wr}  ~  WN  (0,  a/)  ,  (9.5.3) 

X,+1  =  FX,  +  V„  {V,}  ~  WN(0,  Q),  (9.5.4) 

for  t  =  1,2,...,  where  F  and  G  are  assumed  known.  We  set  To  =  1  in  order  to 
include  constant  terms  in  our  predictors  and  complete  the  specification  of  the  model 
by  prescribing  the  mean  and  covariance  matrix  of  the  initial  state  Xi.  A  simple  and 
convenient  assumption  is  that  Xi  is  equal  to  a  deterministic  but  unknown  parameter 

A 

1 1  and  that  Xi  =  fi,  so  that  Q\  —  0.  The  parameters  of  the  model  are  then  /*«,  Q , 
and  <r/. 

Direct  maximization  of  the  likelihood  (9.5.2)  is  difficult  if  the  dimension  of  the 
state  vector  is  large.  The  maximization  can,  however,  be  simplified  by  the  following 
stepwise  procedure.  For  fixed  Q  we  find  fi(Q)  and  or^(Q)  that  maximize  the  likelihood 
L  {ji,  Q,  orfy.  We  then  maximize  the  “reduced  likelihood”  L  ( fi(Q ),  Q ,  cr^(Q))  with 
respect  to  Q. 

To  achieve  this  we  define  the  mean-corrected  state  vectors,  X*  =  Xt  —  Fl~l it,  and 
apply  the  Kalman  prediction  recursions  to  {X*}  with  initial  condition  X^  =  0.  This 
gives,  from  (9.4.1), 
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x;+1  =  FX *  +  &,A;'  ( Y,  -  GX* ) ,  t=  1,2,... 


(9.5.5) 


with  =  0.  Since  X,  also  satisfies  (9.5.5),  but  with  initial  condition  X,  =  ft ,  it 
follows  that 


xt  =  X*  +  ctlL 


(9.5.6) 


for  some  v  x  v  matrices  Ct.  (Note  that  although  Xt  =  P(Xt\Yo,  Y\ ,  . . . ,  Yt),  the  quantity 

A 

x;  is  not  the  corresponding  predictor  of  X*.)  The  matrices  Ct  can  be  determined 
recursively  from  (9.5.5),  (9.5.6),  and  (9.4.1).  Substituting  (9.5.6)  into  (9.5.5)  and  using 
(9.4.1),  we  have 


X*  =F[Xt-Ctii)  +  ©A_i  (Yt-G(Xt-Ctii 


-l 


k7+ 1 


=  FX,  +  @,a;1  (y,  -  GX,)  —  (f  —  ®,a;1g)  ctfi 
=  X,+i  -  ( F-@tA;1G)Ct/i , 


so  that 


Ct+1  =  (f  —  q,a;1g )  ct 


(9.5.7) 


with  Ci  equal  to  the  identity  matrix.  The  quadratic  form  in  the  likelihood  (9.5.2)  is 
therefore 


"  Ut  -  GX,)2 

S(F,  Q,  °l)  =  £  v  7 

t=  1  f 


(9.5.8) 


^  (y,  -  GX] r  -  GCtfi 

2^  ~A, 

t= 1  1 


(9.5.9) 


Now  let  Q*  :=  ow  2Q  and  define  L*  to  be  the  likelihood  function  with  this  new 
parameterization,  i.e.,  L*  (ft,  Q*,  a2)  =  L  (/ 1 ,  a2Q*,  a2).  Writing  A*  =  a~2A,  and 

Q*  —  <j-  £2t,  we  see  that  the  predictors  X*  and  the  matrices  Ct  in  (9.5.7)  depend  on 
the  parameters  only  through  <2*.  Thus, 

S{1l,Q,oI)=o-2S(il,Q\  1), 

so  that 


n 


—2 In L*  (fi,Q*,a2)  =  nln(27r)  +  £lnZi,  +  (rw2S(/i,  Q*,  l) 


7=  1 


n 


=  n \n(2n)  +  £  In  4*  +  n  In  a2  +  <x“2S  (fi,  Q* ,  l)  . 


7=1 


For  <2*  fixed,  it  is  easy  to  show  (see  Problem  9.18)  that  this  function  is  minimized 
when 


n 


E 

Lr=l 


C'fi'GC, 


A* 


1  «  c;g' 

E— 


r,  -  gx; 


7=1 


4* 


A  =  A  (2*) 


(9.5.10) 
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Example  9.5.1. 


Example  9.5.2. 


and 


Y,  -  GX*  -  GCtiL 


(9.5.11) 


Replacing  fi  and  a~.  by  these  values  in  — 21nL*  and  ignoring  constants,  the  reduced 
likelihood  becomes 


(Q*)  =  In 


(Y,  -  GX*  -  GCr£) 
A* 


+  n  1  In  (det  A*')  . 

t=  l 


(9.5.12) 

/V 

If  <2*  denotes  the  minimizer  of  (9.5.12),  then  the  maximum  likelihood  estimator  of  the 
parameters  /i,  Q,  a2  are  fi,  a^Q*,  a2,  where  fi  and  <7  2  are  computed  from  (9.5.10)  and 

(9.5.11)  with  <2*  replaced  by  Q*. 

We  can  now  summarize  the  steps  required  for  computing  the  maximum  likelihood 
estimators  of  fi,  Q ,  and  o 2  for  the  model  (9.5.3)-(9.5.4). 


1.  For  a  fixed  <2*,  apply  the  Kalman  prediction  recursions  with  =  0,  Q\  —  0, 

Q  —  Q *,  and  <j-  =  1  to  obtain  the  predictors  X*.  Let  A*  denote  the  one-step 
prediction  error  produced  by  these  recursions. 

2.  Set  A  =  HQ*)  =  [E;=i  CtG'GCt/Atyl  E"=i  CtG'(Yt  -  G±*t)/A*t. 

3.  Let  <2*  be  the  minimizer  of  (9.5.12). 

4.  The  maximum  likelihood  estimators  of  /i ,  Q ,  and  <tv^  are  then  given  by  /(, 

and  respectively,  where  /l  and  ay,  are  found  from  (9.5.10)  and  (9.5.11) 

evaluated  at  2*. 


Random  Walk  Plus  Noise  Model 


In  Example  9.2.1,  100  observations  were  generated  from  the  structural  model 

Yt=Mr  +  Wt,  {Wt}  ~WN  (0,0-2), 

Mt+i  =  Mt  +  Vt,  { Vt}  ~  WN  (0,  av2) , 

with  initial  values  p  =  M\  =  0,  ay,  —  8,  and  oy  —  4.  The  maximum  likelihood 
estimates  of  the  parameters  are  found  by  first  minimizing  (9.5.12)  with  jl  given  by 
(9.5.10).  Substituting  these  values  into  (9.5.11)  gives  The  resulting  estimates  are 
/x  =  0.906,  <jv2  =  5.351,  and  <j2  =  8.233,  which  are  in  reasonably  close  agreement 
with  the  true  values. 

□ 


International  Airline  Passengers,  1949-1960;  AIRPASS.TSM 

The  monthly  totals  of  international  airline  passengers  from  January  1949  to  December 
1960  (Box  and  Jenkins  1976)  are  displayed  in  Figure  9-3.  The  data  exhibit  both  a 
strong  seasonal  pattern  and  a  nearly  linear  trend.  Since  the  variability  of  the  data 
Y\, ,  Y\44  increases  for  larger  values  of  Yt ,  it  may  be  appropriate  to  consider  a 
logarithmic  transformation  of  the  data.  For  the  purpose  of  this  illustration,  however, 
we  will  fit  a  structural  model  incorporating  a  randomly  varying  trend  and  seasonal  and 
noise  components  (see  Example  9.2.3)  to  the  raw  data.  This  model  has  the  form 
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Figure  9-3 

International  airline 
passengers;  monthly 
totals  from  January  1  949 
to  December  1  960 


Yt  =  GX,  +  Wt,  {Wt}  ~  WN  (0,  or 2)  , 
Xj+1  =  FXt  +  Vf,  {Vf}  ~  WN(0,  0, 


where  Xr  is  a 

13 -dimensional  state-vector, 
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The  parameters  of  the  model  are  ft,  erf,  erf,  a32,  and  a2,  where  fi  =  X  Minimizing 
(9.5.12)  with  respect  to  2*  we  find  from  (9.5.11)  and  (9.5.12)  that 


(170.63,  .00000,  11.338,  .014179) 


and  from  (9.5.10)  that  fi  =(146.9,  2.171,  -34.92,  -34.12,  -47.00,  -16.98,  22.99, 
53.99,  58.34,  33.65,  2.204,  —4.053,  — 6.894)/.  The  first  component,  Xt\,  of  the  state 
vector  corresponds  to  the  local  linear  trend  with  slope  Xl2.  Since  <j22  =  0,  the  slope  at 
time  t ,  which  satisfies 


Xt2  —  Xt-i2  +  Vt2, 


280 


Chapter  9  State-Space  Models 


Figure  9-5 

The  one-step  predictors  Yj 
for  the  airline  passenger 
data  ( solid  line)  and  the 
actual  data  ( square  boxes) 


must  be  nearly  constant  and  equal  to  X 12  =  2.171.  The  first  three  components  of  the 

A 

predictors  Xt  are  plotted  in  Figure  9-4.  Notice  that  the  first  component  varies  like  a 
random  walk  around  a  straight  line,  while  the  second  component  is  nearly  constant  as 
a  result  of  <r92  ~  0.  The  third  component,  corresponding  to  the  seasonal  component, 
exhibits  a  clear  seasonal  cycle  that  repeats  roughly  the  same  pattern  throughout  the  12 

A  A 

years  of  data.  The  one-step  predictors  Xt\  +  X ^  of  Yt  are  plotted  in  Figure  9-5  (solid 
line)  together  with  the  actual  data  (square  boxes).  For  this  model  the  predictors  follow 
the  movement  of  the  data  quite  well. 

□ 


9.6  State-Space  Models  with  Missing  Observations 

State-space  representations  and  the  associated  Kalman  recursions  are  ideally  suited  to 
the  analysis  of  data  with  missing  values,  as  was  pointed  out  by  Jones  (1980)  in  the 
context  of  maximum  likelihood  estimation  for  ARMA  processes.  In  this  section  we 
shall  deal  with  two  missing-value  problems  for  state-space  models.  The  first  is  the 
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evaluation  of  the  (Gaussian)  likelihood  based  on  { Y/, , . . . ,  Y;J,  where  q,  q,  •  •  • >  ir 
are  positive  integers  such  that  1  <  q  <  h  <  *  *  *  <  h  <  n.  (This  allows  for 
observation  of  the  process  { YJ  at  irregular  intervals,  or  equivalently  for  the  possibility 
that  0 n  —  r)  observations  are  missing  from  the  sequence  {Yi,  . . . ,  Yw}.)  The  solution  of 
this  problem  will,  in  particular,  enable  us  to  carry  out  maximum  likelihood  estimation 
for  ARMA  and  ARIMA  processes  with  missing  values.  The  second  problem  to  be 
considered  is  the  minimum  mean  squared  error  estimation  of  the  missing  values 
themselves. 


9.6.1  The  Gaussian  Likelihood  of  {Y#1 , . . . ,  Y/r},  1  <  q  <  #2  <  •  •  •  <  ir  <  n 


Consider  the  state-space  model  defined  by  equations  (9.1.1)  and  (9.1.2)  and  suppose 
that  the  model  is  completely  parameterized  by  the  components  of  the  vector  6.  If  there 
are  no  missing  observations,  i.e.,  if  r  —  n  and  ij  =  jj  =  1, . . . ,  n,  then  the  likelihood 
of  the  observations  {Yi,  . . . ,  Y,J  is  easily  found  as  in  Section  9.5  to  be 


L(G;  Y 


i? 


Y  „)  =  (2jt)~nw/2 


where  1 7-  =  Yj  —  Pj-  \  Yy  and  Aj,  j  >  1,  are  the  one-step  predictors  and  error 
covariance  matrices  found  from  (9.4.7)  and  (9.4.9)  with  Yq  =  1. 

To  deal  with  the  more  general  case  of  possibly  irregularly  spaced  observations 
{Yjj,  . . . ,  Y/r},  we  introduce  a  new  series  {Y*},  related  to  the  process  {XJ  by  the 
modified  observation  equation 


where 


G^Xt  +  W 


* 

t  ’ 


(9.6.1) 


Gt  if  t  e  {i\, . . . ,  ir }, 
0  otherwise, 


Wt  if  te  {i  1, . . . ,  ir}, 
Nt  otherwise, 


and  {Nr}  is  iid  with 


(9.6.2) 


N,~N(0,/wxw),  Nj  1  Xl5  Ns± 


'V/ 

Wf 


s,  t  =  0,  =bl, 


(9.6.3) 


Equations  (9.6.1)  and  (9.1.2)  constitute  a  state-space  representation  for  the  new  series 
{Y*},  which  coincides  with  {YJ  at  each  t  e  {q,  q,  •  •  • ,  6-},  and  at  other  times  takes 
random  values  that  are  independent  of  {YJ  with  a  distribution  independent  of  6. 

Let  L\  (0;  ,  . . . ,  y,r)  be  the  Gaussian  likelihood  based  on  the  observed 

values  ytl, ,  yir  of  Y/1? . . . ,  Y/r  under  the  model  defined  by  (9.1.1)  and  (9.1.2). 
Corresponding  to  these  observed  values,  we  define  a  new  sequence,  yj, . . . ,  y*,  by 


if  t  e  { i\ , . . . ,  ir } , 
otherwise. 


(9.6.4) 


Then  it  is  clear  from  the  preceding  paragraph  that 


(9.6.5) 


where  L2  denotes  the  Gaussian  likelihood  under  the  model  defined  by  (9.6.1)  and 
(9.1.2). 
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Example  9.6.1. 


In  view  of  (9.6.5)  we  can  now  compute  the  required  likelihood  L\  of  the  realized 
values  { yt ,  t  =  i\, ... ,  ir }  as  follows: 


i.  Define  the  sequence  {y *,t  =  1, . . . ,  n}  as  in  (9.6.4). 

A 

ii.  Find  the  one-step  predictors  Y*  of  Y*,  and  their  error  covariance  matrices  A*, 
using  Kalman  prediction  and  equations  (9.4.7)  and  (9.4.9)  applied  to  the  state- 
space  representation  (9.6.1)  and  (9.1.2)  of  {Y*}.  Denote  the  realized  values  of  the 
predictors,  based  on  the  observation  sequence  jy*  }>by  {yd- 

iii.  The  required  Gaussian  likelihood  of  the  irregularly  spaced  observations  {y,-, , . . . , 
y(V}  is  then,  by  (9.6.5), 


Li(6»;y(1, . . . ,  yir)  =  (2n) 


—rw/2 


where  i*  denotes  the  observed  innovation  y *  —  y *,j=  1, ...  ,n. 


An  AR(1)  Series  with  One  Missing  Observation 


Let  {Yt}  be  the  causal  AR(1)  process  defined  by 
Y,  —  <pYt-i  =  Zt,  {Z,|~WN(0,a2). 

To  find  the  Gaussian  likelihood  of  the  observations  yi,  y4,  and  3/5  of  Y\,  F3,  Y4,  and 
F5  we  follow  the  steps  outlined  above. 


i.  Set  y*  =  yt,  i  =  1,  3,  4,  5  and  y^  =  0. 

ii.  We  start  with  the  state-space  model  for  { Yt }  from  Example  9.1.1,  i.e.,  Yt  — 
Xt,  Xt+\  —  <pXt  +  Zt+\.  The  corresponding  model  for  {F*}  is  then,  from  (9.6.1), 

f;  =  G;xt  +  Wf,t=  1,2,..., 

where 


FtXt  +  Vt,  t  —  1,2,.. 


5 


F,  =  <p,  G* 


II  if  t  ^  2, 

vt  =  zt+l,  w; 

0  tit  =  2, 


10  if  t  ^  2, 
Nt  tit  =  2, 


Qt  =  o' 


10  tit  ^2, 

V  =  0, 

1  tit  =  2, 


andXi  =  '}2y{j(prZ\  j.  Starting  from  the  initial  conditions 

V  =0,  £2\  —  a1!  (1  -  02) , 

and  applying  the  recursions  (9.4.1)  and  (9.4.2),  we  find  (Problem  9.19)  that 


(/>  iff  =1,3, 4,  5, 
0  tit  =  2, 


for2/  (1  -Cj>2) 

•  CT2  (l  +  <f> 2) 


iff  =  1, 
tit  =  3, 
iff  =  2,  4,5, 


X1=0,  X2  =  <PY  1,  x3=^yu  x4  =  4>y3,  X5  =  4>Y4. 


and 
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From  (9.4.7)  and  (9.4.9)  with  h  =  1,  we  find  that 


y;  =  o.  f*  =  o,  y;  =  <pzy,,  y;  =  4>y3,  y;  =  <py4 

with  corresponding  mean  squared  errors 


a*  =  °2!  (i  -  4>2) 


A*=l 


A*3  =  a2  (1  +  02)  ,  A*=a: 


a;  =  a2. 


iii.  From  the  preceding  calculations  we  can  now  write  the  likelihood  of  the  original 
data  as 


Li(<p,  cr2;  y\,y?,,yA,y5)=<y  4(2tt)  2  [(l -<p2)  /  (l+<p2)] 

1 


1/2 


x  exp 


2a 2 


y\  +  ^31^,2 ~  +Cy4-</>y3)2+(y5-#4)2 


□ 


Remark  1.  If  we  are  given  observations  y\-d,  yi-d ,  •  •  • ,  Jo.  Fi2>  •  •  • ,  V/r  of  an 
ARIMA(p,  d,  q)  process  at  times  1  —  d,  2  —  d,  . . . ,  0,  i\,  . . . ,  ir,  where  1  <  q  < 
i2  <  •  •  •  <  ir  <  n,  a  similar  argument  can  be  used  to  find  the  Gaussian  likelihood  of 
•  •  • , ytr  conditional  on  Yx_d  =  yi_d,  Y2-d  =  y2-d,  ...,Y0  =  yo-  Missing  values 
among  the  first  d  observations  y\-d,  y2-d, . . . ,  yo  can  be  handled  by  treating  them  as 
unknown  parameters  for  likelihood  maximization.  For  more  on  ARIMA  series  with 
missing  values  see  Brockwell  and  Davis  (1991)  and  Ansley  and  Kohn  (1985).  □ 


9.6.2  Estimation  of  Missing  Values  for  State-Space  Models 


Given  that  we  observe  only  Y/, ,  Yi2,  ,  Yir,  1  <  i\  <  i2  <  •  •  •  <  ir  <  n ,  where  [Yt] 
has  the  state-space  representation  (9.1.1)  and  (9.1.2),  we  now  consider  the  problem 
of  finding  the  minimum  mean  squared  error  estimators  P  (Y,|Yo,  Y^ ,  . . . ,  Y*r)  of  Yr, 
1  <  t  <  n,  where  Yq  =  1.  To  handle  this  problem  we  again  use  the  modified  process 
{ Y *}  defined  by  (9.6.1)  and  (9.1.2)  with  Y*  =  1.  Since  Y*  =  Ys  for  j  g  {fi, . . . ,  ir) 
and  Y*  _L  Xt,  Yq  for  1  <  t  <  n  and  s  ^  {0,  q,  . . . ,  /r},  we  immediately  obtain  the 
minimum  mean  squared  error  state  estimators 


P  (X/|Y0,  Yjj, 


P(X,  I 


V* 

x0’ 


v* 

1 1’ 


1  <  t  <  n. 


(9.6.6) 


The  right-hand  side  can  be  evaluated  by  application  of  the  Kalman  fixed-point 
smoothing  algorithm  to  the  state-space  model  (9.6.1)  and  (9.1.2).  For  computational 
purposes  the  observed  values  of  Y*,  t  £  {0,  i\,  ,  /r},  are  quite  immaterial.  They 

may,  for  example,  all  be  set  equal  to  zero,  giving  the  sequence  of  observations  of  Y* 
defined  in  (9.6.4). 

To  evaluate  P  (Yr|Yo,  Y^,  . . . ,  Y*r),  1  <  t  <  n,  we  use  (9.6.6)  and  the  relation 


Yt  =  GtXt  +  W,.  (9.6.7) 

Since  E  (YfW')  =  St  =  0,  t  =  1, . . . ,  n,  we  find  from  (9.6.7)  that 

P  (Yt|Y0,  Y/j, . . . ,  YlV)  =  G,P  (Xr|Y*,  Yt,  . . . ,  Y*)  .  (9.6.8) 


Example  9.6.2.  An  AR(1)  Series  with  One  Missing  Observation 

Consider  the  problem  of  estimating  the  missing  value  Y2  in  Example  9.6.1  in  terms  of 
To  =  1,  Y\,  F3,  Y4,  and  F5.  We  start  from  the  state- space  model  Xt+\  —  <fiXt  +  Zt+ 1, 
Yt  —  Xu  for  {FJ.  The  corresponding  model  for  {F*}  is  the  one  used  in  Example  9.6.1. 
Applying  the  Kalman  smoothing  equations  to  the  latter  model,  we  find  that 
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and 


p1x2  =  4>yu 

P4X2  =  P3X2, 


P2X2  =  4>YX, 
P5X2  =  P3X2, 


PjX2  = 


<HXx  +  r3) 

(i  +  4>2)  ' 


&2,t  —  0?  t  —  4, 


2 


(1+02)’ 


t  >  3, 


where  Pt(-)  here  denotes  P  (-|Fq,  . . . ,  F*)  and  Qt \n  are  defined  correspondingly. 
We  deduce  from  (9.6.8)  that  the  minimum  mean  squared  error  estimator  of  the  missing 
value  Y2  is 


P5Y2  =  P5X 2  = 


<HJi  +  y3) 

(1  +  <t>2) 


with  mean  squared  error 


(1  +<p2Y 


□ 


Remark  2.  Suppose  we  have  observations  Y Y2-d,  . . . ,  Yq,  Ytl, . . . ,  Yir  (1  <  i\  < 
h' '  -  <  U  <  n)  of  an  ARIMA(/?,  d,  q )  process.  Determination  of  the  best  linear 
estimates  of  the  missing  values  Yt,  t  £  {i\, ,  ir },  in  terms  of  Yt ,  t  e  {i\,  ... ,  ir }, 
and  the  components  of  Yo  :=  (Y\-d,  ^2 -d,  •  •  • ,  Fo)7  can  be  carried  out  as  in 

Example  9.6.2  using  the  state-space  representation  of  the  ARIMA  series  {Yt}  from 
Example  9.3.3  and  the  Kalman  recursions  for  the  corresponding  state-space  model 
for  {F*}  defined  by  (9.6.1)  and  (9.1.2).  See  Brockwell  and  Davis  (1991)  for  further 
details.  □ 


We  close  this  section  with  a  brief  discussion  of  a  direct  approach  to  estimating 
missing  observations.  This  approach  is  often  more  efficient  than  the  methods  just 
described,  especially  if  the  number  of  missing  observations  is  small  and  we  have 
a  simple  (e.g.,  autoregressive)  model.  Consider  the  general  problem  of  computing 
£’(X|Y)  when  the  random  vector  (Xr,  Y')f  has  a  multivariate  normal  distribution  with 
mean  0  and  covariance  matrix  E .  (In  the  missing  observation  problem,  think  of  X  as 
the  vector  of  the  missing  observations  and  Y  as  the  vector  of  observed  values.)  Then 
the  joint  probability  density  function  of  X  and  Y  can  be  written  as 

/x.y(x,  y)  =/X|Y(x|y)/Y(y),  (9.6.9) 

where/X|Y(x|y)  is  a  multivariate  normal  density  with  mean  £(X|Y)  and  covariance 
matrix  EX|Y  (see  Proposition  A.3.1).  In  particular, 


/x,y(xly) 


1 


y/Qn)*  det  EX|y 


exp 


1 


--(x-£(X|y))'Ex^(x-£(X|y)) 


(9.6.10) 


where  q  —  dim(X).  It  is  clear  from  (9.6.10)  that /X|Y(x|y)  (and  also  /x,y(x,  y)) 
is  maximum  when  x  =  2?(X|y).  Thus,  the  best  estimator  of  X  in  terms  of  Y  can  be 
found  by  maximizing  the  joint  density  of  X  and  Y  with  respect  to  x.  For  autoregressive 
processes  it  is  relatively  straightforward  to  carry  out  this  optimization,  as  shown  in  the 
following  example. 


9.7  The  EM  Algorithm 


285 


Example  9.6.3.  Estimating  Missing  Observations  in  an  AR  Process 

Suppose  {Fr}  is  the  AR(/?)  process  defined  by 

Yt  =  fa Yt-i  +  ■  ■  ■  +  4>pY,_p  +  Zt,  {Ztj  ~  WN  (0,  a2)  , 

and  Y  =  (Yh, ,  Yir)',  with  1  <  i\  <  •  •  •  <  ir  <  n,  are  the  observed  values.  If  there 
are  no  missing  observations  in  the  first  p  observations,  then  the  best  estimates  of  the 
missing  values  are  found  by  minimizing 

n 

J2  (Yt  -01^-1 - 4>pYt-P)2  (9.6.11) 

t=p+ 1 

with  respect  to  the  missing  values  (see  Problem  9.20).  For  the  AR(1)  model  in 
Example  9.6.2,  minimization  of  (9.6.11)  is  equivalent  to  minimizing 

(y2  -  m2  +  (y3  - 

with  respect  to  Y2.  Setting  the  derivative  of  this  expression  with  respect  to  Y2  equal  to 
0  and  solving  for  Y2  we  obtain  E(Y2\Y\,  F3,  F4,  F5)  =  0(Fi  +  F3)/  (l  +  </>2). 

□ 


9.7  The  EM  Algorithm 

The  expectation-maximization  (EM)  algorithm  is  an  iterative  procedure  for  computing 
the  maximum  likelihood  estimator  when  only  a  subset  of  the  complete  data  set  is 
available.  Dempster  et  al.  (1977)  demonstrated  the  wide  applicability  of  the  EM 
algorithm  and  are  largely  responsible  for  popularizing  this  method  in  statistics.  Details 
regarding  the  convergence  and  performance  of  the  EM  algorithm  can  be  found  in  Wu 
(1983). 

In  the  usual  formulation  of  the  EM  algorithm,  the  “complete”  data  vector  W  is 
made  up  of  “observed”  data  Y  (sometimes  called  incomplete  data)  and  “unobserved” 
data  X.  In  many  applications,  X  consists  of  values  of  a  “latent”  or  unobserved  process 
occurring  in  the  specification  of  the  model.  For  example,  in  the  state-space  model  of 
Section  9.1,  Y  could  consist  of  the  observed  vectors  Yi,  . . . ,  Yn  and  X  of  the  unob¬ 
served  state  vectors  Xi, . . . ,  Xn.  The  EM  algorithm  provides  an  iterative  procedure 
for  computing  the  maximum  likelihood  estimator  based  only  on  the  observed  data  Y. 
Each  iteration  of  the  EM  algorithm  consists  of  two  steps.  If  9 denotes  the  estimated 
value  of  the  parameter  9  after  i  iterations,  then  the  two  steps  in  the  (i  +  l)th  iteration 
are 

E-step.  Calculate  Q(9 16>(0)  =  Eem  [1(0:  X,  Y)|Y] 

and 

M-step.  Maximize  Q{6 1#(/))  with  respect  to  0. 

Then  is  set  equal  to  the  maximizer  of  Q  in  the  M-step.  In  the  E-step,  £(0;x,  y)  = 
ln/(x,  y;  9 ),  and  £'0(o(*|Y)  denotes  the  conditional  expectation  relative  to  the  condi¬ 
tional  density /(x|y;  9 w)  =/(x,  y;  0w)//( y;  9 (i)). 

It  can  be  shown  that  Y)  is  nondecreasing  in  i,  and  a  simple  heuristic 

argument  shows  that  if  9  ®  has  a  limit  9  then  9  must  be  a  solution  of  the  likelihood 
equations  l Y)  =  0.  To  see  this,  observe  that  In /(x,  y;  9)  =  ln/(x|y;  9)  +£(9;  y), 
from  which  we  obtain 

Q  (<9|<9(0)  =  /  (ln/(x|Y;  9))f  (x|Y;  6>(0)  dx  +  i(9\  Y) 
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and 


3 

86 


■/(x|Y;  9) 


//(x|Y;  9)f  (x|Y;  9(i))  dx  +  i'(6;  Y). 


Now  replacing  0  with  9(l+l\  noticing  that  Q'(9i'+l)\9U))  =  0,  and  letting  i  —>  oo,  we 
find  that 


dx  +  l'  (§;y) 


=  1'  (o;  y)  . 


The  last  equality  follows  from  the  fact  that 


(/(x|Y;  9)  dx 


0=0 


8 

8 6 


/(x|Y; 0) 


dx. 

/V 

0=0 


The  computational  advantage  of  the  EM  algorithm  over  direct  maximization  of  the 
likelihood  is  most  pronounced  when  the  calculation  and  maximization  of  the  exact 
likelihood  is  difficult  as  compared  with  the  maximization  of  Q  in  the  M-step.  (There  are 
some  applications  in  which  the  maximization  of  Q  can  easily  be  carried  out  explicitly.) 


9.7.1  Missing  Data 

The  EM  algorithm  is  particularly  useful  for  estimation  problems  in  which  there  are 
missing  observations.  Suppose  the  complete  data  set  consists  of  Y\,  . . . ,  Yn  of  which 
r  are  observed  and  n  —  r  are  missing.  Denote  the  observed  and  missing  data  by  Y  = 
(T/, ,  . . . ,  Yir)'  and  X  =  (Y^, . . . ,  Yjn_r)\  respectively.  Assuming  that  W  =  (Xr,  Y')' 
has  a  multivariate  normal  distribution  with  mean  0  and  covariance  matrix  E ,  which 
depends  on  the  parameter  6 ,  the  log-likelihood  of  the  complete  data  is  given  by 

1  1 

1(6;  W)  =  — -ln(2 n)  -  -lndet(S)  -  -W'SW. 

The  E-step  requires  that  we  compute  the  expectation  of  t(0\  W)  with  respect  to  the 
conditional  distribution  of  W  given  Y  with  6=6^l\  Writing  E(0)  as  the  block  matrix 


which  is  conformable  with  X  and  Y,  the  conditional  distribution  of  W  given  Y  is 

multivariate  normal  with  mean  [*]  and  covariance  matrix  [Sll>)2(0)  where  X  = 

i?0(X|Y)  =  E^E^Y  and  En|2(0)  =  En  —  (see  Proposition  A.3.1). 

Using  Problem  A. 8,  we  have 

E0i o  [(X',  Y')S_1(0)(X',  Y')'|Y]  =  trace  d-W'S-'C^W, 

where  W  =  (x',  Y')  .  It  follows  that 


Q  (6\6W)  =  l  (e,  w)  -  -trace  (s„|2  (6(i))  S"j 2(0))  . 

The  first  term  on  the  right  is  the  log-likelihood  based  on  the  complete  data,  but  with 
X  replaced  by  its  “best  estimate”  X  calculated  from  the  previous  iteration.  If  the 
increments  Q6+1)  —  Qd)  are  sman?  then  the  second  term  on  the  right  is  nearly  constant 
(~  n  —  r)  and  can  be  ignored.  For  ease  of  computation  in  this  application  we  shall  use 
the  modified  version 

Q(6\6(i))  =i(d;Wy 
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With  this  adjustment,  the  steps  in  the  EM  algorithm  are  as  follows: 

E-step.  Calculate  Eg®  (X|  Y)  (e.g.,  with  the  Kalman  fixed-point  smoother)  and  form 
l{6\  W). 

M-step.  Find  the  maximum  likelihood  estimator  for  the  “complete”  data  problem, 
i.e.,  maximize  e(0  :  w).  For  ARMA  processes,  ITSM  can  be  used  directly,  with 
the  missing  values  replaced  with  their  best  estimates  computed  in  the  E-step. 

Example  9.7.1 .  The  Fake  Data 

It  was  found  in  Example  5.2.5  that  the  AR(2)  model 

Wt  -  1.0415W,_i  +  0.2494 W?_2  =  Z„  {Zt}  ~  WN(0,  .4790) 

was  a  good  fit  to  the  mean-corrected  lake  data  {Wr}.  To  illustrate  the  use  of  the  EM 
algorithm  for  missing  data,  consider  fitting  an  AR(2)  model  to  the  mean-corrected 
data  assuming  that  there  are  10  missing  values  at  times  t  =  17,  24,  31,  38,  45,  52,  59, 
66,  73,  and  80.  We  start  the  algorithm  at  iteration  0  with  0 ^  =  0^  =  0.  Since  this 

initial  model  represents  white  noise,  the  first  E-step  gives,  in  the  notation  used  above, 

A  A 

Wn  =  •  •  •  =  Wg o  =  0.  Replacing  the  “missing”  values  of  the  mean-corrected  lake  data 
with  0  and  fitting  a  mean-zero  AR(2)  model  to  the  resulting  complete  data  set  using 
the  maximum  likelihood  option  in  ITSM,  we  find  that  0j1}  =  0.7252,  0^  =  0.0236. 
(Examination  of  the  plots  of  the  ACF  and  PACF  of  this  new  data  set  suggests  an  AR(1) 
as  a  better  model.  This  is  also  borne  out  by  the  small  estimated  value  of  02.)  The 
updated  missing  values  at  times  t  —  17,  24,  ... ,  80  are  found  (see  Section  9.6  and 
Problem  9.21)  by  minimizing 

E  (w'+j  -  1  -  Wwt+j- 2)2 

7=0 

with  respect  to  Wt.  The  solution  is  given  by 

+ w, +2)  +  (^‘>  -  w-i  +  wM) 

W’~  1  +  (ff’)2  +  (i'-i  ' 

The  M-step  of  iteration  1  is  then  carried  out  by  fitting  an  AR(2)  model  using 
ITSM  applied  to  the  updated  data  set.  As  seen  in  the  summary  of  the  results  reported 
in  Table  9.1,  the  EM  algorithm  converges  in  four  iterations  with  the  final  parameter 
estimates  reasonably  close  to  the  fitted  model  based  on  the  complete  data  set.  (In 
Table  9.1,  estimates  of  the  missing  values  are  recorded  only  for  the  first  three.) 
Also  notice  how  —21  ( 9(l\  W)  decreases  at  every  iteration.  The  standard  errors  of 
the  parameter  estimates  produced  from  the  last  iteration  of  ITSM  are  based  on  a 
“complete”  data  set  and,  as  such,  underestimate  the  true  sampling  errors.  Formulae  for 
adjusting  the  standard  errors  to  reflect  the  true  sampling  error  based  on  the  observed 
data  can  be  found  in  Dempster  et  al.  (1977). 

□ 


9.8  Generalized  State-Space  Models 

As  in  Section  9. 1,  we  consider  a  sequence  of  state  variables  {Xt,  t  >  1}  and  a  sequence 
of  observations  { Yt ,  t  >  1}.  For  simplicity,  we  consider  only  one-dimensional  state  and 
observation  variables,  since  extensions  to  higher  dimensions  can  be  carried  out  with 
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Table  9.1  Estimates  of  the  missing  observations  at  times  t  =  17, 

24,  31  and  the  AR  estimates  using  the  EM  algorithm  in 
Example  9.7.1 


Iteration  / 

Wiy 

14/24 

W31 

$ 

-2  l(o{'\\N\ 

0 

0 

0 

322.60 

1 

0 

0 

0 

0.7252 

0.0236 

244.76 

2 

0.534 

0.205 

0.746 

1.0729 

-0.2838 

203.57 

3 

0.458 

0.393 

0.821 

1.0999 

-0.3128 

202.25 

4 

0.454 

0.405 

0.826 

1.0999 

-0.3128 

202.25 

little  change.  Throughout  this  section  it  will  be  convenient  to  write  Y(r)  and  X(r)  for  the 
t  dimensional  column  vectors  Y(/)  =  (Y1?  Y2,  . . . ,  Yt)f  and  X(r)  =  (X\,  X2,  . . . ,  Xt)f. 

There  are  two  important  types  of  state-space  models,  “parameter  driven”  and 
“observation  driven,”  both  of  which  are  frequently  used  in  time  series  analysis.  The 
observation  equation  is  the  same  for  both,  but  the  state  vectors  of  a  parameter-driven 
model  evolve  independently  of  the  past  history  of  the  observation  process,  while  the 
state  vectors  of  an  observation-driven  model  depend  on  past  observations. 


9.8.1  Parameter-Driven  Models 


In  place  of  the  observation  and  state  equations  (9.1.1)  and  (9.1.2),  we  now  make  the 
assumptions  that  Yt  given  (Xt,  X(r-1),  Y^_1))  is  independent  of  Y^-1))  with 

conditional  probability  density 

p(yt \xt)  :=  p(yt \xt,  x(;_1),  y(f-1)),  t—  1,2,...,  (9.8.1) 

and  that  Xt+\  given  (X,.  Y(,))  is  independent  of  (X(f_1),  Y(,))  with  conditional 

density  function 

p(xt+i\x,)  :=  p(xt+i\xt,  x(;_1),  y(f))  t  —  1,2, -  (9.8.2) 


We  shall  also  assume  that  the  initial  state  Xf  has  probability  density  p\ .  The  joint 
density  of  the  observation  and  state  variables  can  be  computed  directly  from  (9.8.1)— 
(9.8.2)  as 


p(y\,  ■■■,yn,x i,  ...,xn)  =  p(yn\xn, x 


("_1),  y("_1)) p  (xn,  x("_1),  y("_1)) 

=  p(yn\xn)p  (x„|x("_1),  y("-1)) p  (y(n_1),  x(',_1)) 
=  p(yn\Xn)P(Xn\Xn-l)P  (y('!_1),  X("_1)) 


n 


n 


=  i  i  ^i  i  p(xj\xj- 1 )  |pi(xo 

and  since  (9.8.2)  implies  that  {Xr}  is  Markov  (see  Problem  9.22), 


n 


piyi,  ■  ■  -,yn\x i,  ...,xn)=  Upiyjlxj)  . 


(9.8.3) 
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Example  9.8.1. 


We  conclude  that  Y\, ...  ,Yn  are  conditionally  independent  given  the  state  variables 
Xi,  ... ,  Xn,  so  that  the  dependence  structure  of  {Yt}  is  inherited  from  that  of  the  state 
process  {XJ.  The  sequence  of  state  variables  {Xr}  is  often  referred  to  as  the  hidden  or 
latent  generating  process  associated  with  the  observed  process. 

In  order  to  solve  the  filtering  and  prediction  problems  in  this  setting,  we  shall 
determine  the  conditional  densities  p  (jc*|yw)  of  Xt  given  Y(r),  and  p  (;q|y(r_1))  of  Xt 
given  Y(r_1),  respectively.  The  minimum  mean  squared  error  estimates  of  Xt  based  on 
Y(t)  and  Y(f_1)  can  then  be  computed  as  the  conditional  expectations,  E  (X?|Y(/))  and 
E(Xt  |Y(?-1}). 

An  application  of  Bayes’s  theorem,  using  the  assumption  that  the  distribution  of 
Yt  given  (Xt,  X(r-1),  Y(f_1))  does  not  depend  on  (X(r_1),  Y(r-1)),  yields 


p  (xt\y(t))  =  p(yt\x,)p  (xt\y(t  l))/p(yt\y(t  !)) 

(9.8.4) 

p(xt+  i|yw)  =  j  p(xt\y{t))p(xt+i\xt)dp,(xt). 

(9.8.5) 

(The  integral  relative  to  dp{xt)  in  (9.8.4)  is  interpreted  as  the  integral  relative  to  dxt 
in  the  continuous  case  and  as  the  sum  over  all  values  of  xt  in  the  discrete  case.)  The 
initial  condition  needed  to  solve  these  recursions  is 

7>(*ily(0))  -.=  Pi(xi).  (9.8.6) 

The  factor  p  (yr|y(f_1))  appearing  in  the  denominator  of  (9.8.4)  is  just  a  scale  factor, 
determined  by  the  condition  f  p  (xt |y(r))  dp(xt)  —  1.  In  the  generalized  state- 
space  setup,  prediction  of  a  future  state  variable  is  less  important  than  forecasting  a 
future  value  of  the  observations.  The  relevant  forecast  density  can  be  computed  from 
(9.8.5)  as 

P  G+ily(r))  =  f  p(yt+i\x,+i)p  (xt+i\y{t))  dp(xt+1).  (9.8.7) 

Equations  (9.8.I)-(9.8.2)  can  be  regarded  as  a  Bayesian  model  specification.  A 
classical  Bayesian  model  has  two  key  assumptions.  The  first  is  that  the  data  Y\,  ...  ,Yt, 
given  an  unobservable  parameter  (X(r)  in  our  case),  are  independent  with  specified 
conditional  distribution.  This  corresponds  to  (9.8.3).  The  second  specifies  a  prior 
distribution  for  the  parameter  value.  This  corresponds  to  (9.8.2).  The  posterior 
distribution  is  then  the  conditional  distribution  of  the  parameter  given  the  data.  In 
the  present  setting  the  posterior  distribution  of  the  component  Xt  of  Xw  is  determined 
by  the  solution  (9.8.4)  of  the  filtering  problem. 

Consider  the  simplified  version  of  the  linear  state-space  model  of  Section  9.1, 


Yt  =  GXt  +  Wu  {Wt}  -  iid  N(0,  R),  (9.8.8) 

Xt+1  =  FXt  +  Vu  {V,}  -  iid  N(0,  <2),  (9.8.9) 

where  the  noise  sequences  { Wt]  and  { Vt}  are  independent  of  each  other.  For  this  model 
the  probability  densities  in  (9.8.1)-(9.8.2)  become 

pi(x\)  —  n(x i;  EX i,  Var(Xi)),  (9.8.10) 

piyMt)  =  n(yt\  Gxt,R ),  (9.8.11) 

p(xt+i\xt)  =  n(xt+i;Fxt,  Q),  (9.8.12) 

where  n  (x;  p,  cr2)  is  the  normal  density  with  mean  p  and  variance  o2  defined  in 
Example  (a)  of  Section  A.l. 
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To  solve  the  filtering  and  prediction  problems  in  this  new  framework,  we  first 
observe  that  the  filtering  and  prediction  densities  in  (9.8.4)  and  (9.8.5)  are  both  normal. 
We  shall  write  them,  using  the  notation  of  Section  9.4,  as 


and 


P  C|Y(0)  =  n(x, ;  Xt]t,  Q, „) 


p(xt+  i|Y(0)  =  n  (xt+uXt+i,  Qt+i)  . 


From  (9.8.5),  (9.8.12),  (9.8.13),  and  (9.8.14),  we  find  that 

/oo 

xt+ip{xt+i\Y{,))dxt+x 


X,4-1  = 


— OO 
'OO 


/OO  fiOQ 

xt+1  /  p(x,\Y<n)p(x,+i  \x,)  dx,dxt+i 

-OO  J  —  OO 

/oo  r  /»oo 

p(xt  |YW)  /  xt+ip(xt+i\xt)  dxt+i 

-oo  \_J  —  OO 

/ 


— oo 
oo 


dxt 


Fx,p(xt  |YW)  dxt 


(9.8.13) 


(9.8.14) 


=  FXt  „ 

and  (see  Problem  9.23) 

42r+i  —  ^  ^rir  +  C- 

Substituting  the  corresponding  densities  (9.8.11)  and  (9.8.14)  into  (9.8.4),  we  find  by 
equating  the  coefficient  of  xj  on  both  sides  of  (9.8.4)  that 

^  =  G2R~l  +  42“'  =  G2R-1  +  (F2f2r_i|?_i  +  0-1 

and 

Xf|f  =  Xr  +  (y,  -  GX,)  . 

Also,  from  (9.8.4)  with p  (x\  |y(0))  =  n(x\,  EX i,  ^i)  we  obtain  the  initial  conditions 

Xiu  =  EXx  +  £2i\\GR~l(Y\  -  GEXx) 
and 


Q~\  =  G2/?-1  + 


The  Kalman  prediction  and  filtering  recursions  of  Section  9.4  give  the  same  results  for 

A 

Xt  mdXt\t,  since  for  Gaussian  systems  best  linear  mean  square  estimation  is  equivalent 
to  best  mean  square  estimation. 


□ 


Example  9.8.2.  A  non-Gaussian  Example 

In  general,  the  solution  of  the  recursions  (9.8.4)  and  (9.8.5)  presents  substantial 
computational  problems.  Numerical  methods  for  dealing  with  non-Gaussian  models 
are  discussed  by  Sorenson  and  Alspach  (1971)  and  Kitagawa  (1987).  Here  we  shall 
illustrate  the  recursions  (9.8.4)  and  (9.8.5)  in  a  very  simple  special  case.  Consider  the 
state  equation 


Xt  =  aXt_  i, 


(9.8.15) 
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with  observation  density 

(jtXf)  yte~nxt 

p(y,\x,)  = - : - ,  yt  =  0,  l,...,  (9.8.16) 

yJ- 

where  n  is  a  constant  between  0  and  1.  The  relationship  in  (9.8.15)  implies  that  the 
transition  density  [in  the  discrete  sense — see  the  comment  after  (9.8.5)]  for  the  state 
variables  is 


p(x,+i\xt)  =  • 


1, 

0, 


if  xt+\  =  ax, , 
otherwise. 


We  shall  assume  that  X\  has  the  gamma  density  function 

yPx  a  —  l  —Xxi 

Pi(x\)  =  g(xi;  a,  X)  = - C— - ,  xi  >  0. 

r(a) 

(This  is  a  simplified  model  for  the  evolution  of  the  number  Xt  of  individuals  at  time 
t  infected  with  a  rare  disease,  in  which  Xt  is  treated  as  a  continuous  rather  than  an 
integer- valued  random  variable.  The  observation  Yt  represents  the  number  of  infected 
individuals  observed  in  a  random  sample  consisting  of  a  small  fraction  n  of  the 
population  at  time  t.)  Because  the  transition  distribution  of  {X?}  is  not  continuous, 
we  use  the  integrated  version  of  (9.8.5)  to  compute  the  prediction  density.  Thus, 

poo 

p  {x,  <  x|y(r_1))  =  /  P(X,  <  x\xt_i)p  (xf_i  |y(r_1))  dx,_ i 

Jo 

px/a 

=  p  (xr_i|y(r_1))  dx,_ i. 

Jo 

Differentiation  with  respect  to  x  gives 

P  Gly(f_1))  =  |Y(»-1)  («_1^ly(f_1))  •  (9.8.17) 

Now  applying  (9.8.4),  we  find  that 


P(xi\yi)  =  p(yi\xi)p!(xi)/p(yi) 


/(jrxiPe-^X 

V  >h!  / 


Xaxa~le~kxi 


P(a) 


1 


p(y  0 


=  c(yi)x“+Vl  le 


x\  >  0, 


where  c(yi)  is  an  integration  factor  ensuring  that p(-\yi)  integrates  to  1.  Since p(-\yi) 
has  the  form  of  a  gamma  density,  we  deduce  (see  Example  (d)  of  Section  A.l)  that 


p(x llyi)  =  g(x i;  oti,  Ai), 


(9.8.18) 


where  a\  =  a  +  yi  and  /.|  =  X  +  tt.  The  prediction  density,  calculated  from  (9.8.5) 
and  (9.8.18),  is 

p(x 2|y(1))  =a~lpXllYm  (a_1x2|y(1)) 

=  a~lg  (a_1x2;  ai,  Xi) 

=  g(x2\ ot\,  Xi/a). 


Iterating  the  recursions  (9.8.4)  and  (9.8.5)  and  using  (9.8.17),  we  find  that  for  t  >  1, 


p  Gly(f))  =  g(xt;  a„  X,) 


(9.8.19) 
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and 

P  {*t+ 1  ly(0)  =  a~lg  (cTlxt+i;  otu  kt) 

=  g(xt+ 1 ;  oit,  kt/a ),  (9.8.20) 

where  at  —  at_\  +  yt  =  a  +  yi  +  •••+  yt  and  kt  =  kt_\/a  +  n  =  + 

7r  (l  —  a~f)  /(I  —  a~l).  In  particular,  the  minimum  mean  squared  error  estimate  of 
xt  based  on  y(/)  is  the  conditional  expectation  at/kt  with  conditional  variance  at/k 2 . 
From  (9.8.7)  the  probability  density  of  Yt+ i  given  Y ®  is 

m  ( (7txtA_i)yt+le~7TXt+l\ 

P(yt+  ily°)=  /  ( - ; - )  g(xt+l;at,kt/a)dxt+l 

Jo  \  yt+i'-  ) 

_  r (at  +  yt+\)  L _ 71  /  7i  \yt+[ 

r(ott)r(yt+ i  +  i)  v  kt+ 1/  \kt+ 1/ 

=  ^(jr+t;  i  -  ^Ar+i),  ^+i  =  o,  l, . . . , 

where  nb(y;a,p )  is  the  negative  binomial  density  defined  in  example  (i)  of  Sec¬ 
tion  A.l.  Conditional  on  Yw,  the  best  one-step  predictor  of  Yt+  \  is  therefore  the  mean, 
at7t /(kt+\  —  7r),  of  this  negative  binomial  distribution.  The  conditional  mean  squared 
error  of  the  predictor  is  Var(Fr+i|Y(^)  =  at7ikt+i/(kt+i  —  n)2  (see  Problem  9.25). 

□ 


Example  9.8.3.  A  Model  for  Time  Series  of  Counts 


We  often  encounter  time  series  in  which  the  observations  represent  count  data.  One 
such  example  is  the  monthly  number  of  newly  recorded  cases  of  poliomyelitis  in  the 
U.S.  for  the  years  1970-1983  plotted  in  Figure  9-6.  Unless  the  actual  counts  are  large 
and  can  be  approximated  by  continuous  variables,  Gaussian  and  linear  time  series 
models  are  generally  inappropriate  for  analyzing  such  data.  The  parameter-driven 
specification  provides  a  flexible  class  of  models  for  modeling  count  data.  We  now 
discuss  a  specific  model  based  on  a  Poisson  observation  density.  This  model  is  similar 
to  the  one  presented  by  Zeger  (1988)  for  analyzing  the  polio  data.  The  observation 
density  is  assumed  to  be  Poisson  with  mean  exp{xr},  i.e., 

extyte~ext 

p(yt\xt)  = - i - ,  37  =  0,1,...,  (9.8.21) 

yt- 

while  the  state  variables  are  assumed  to  follow  a  regression  model  with  Gaussian 
AR(1)  noise.  If  =  (ut i,  . . . ,  uh A  are  the  regression  variables,  then 

X,  =  p'ut  +  Wu  (9.8.22) 


where  (3  is  a  ^-dimensional  regression  parameter  and 

+  zt,  {Z,}  -  IID  N  (0,  a2) . 


The  transition  density  function  for  the  state  variables  is  then 
p(xt+i\xt)  =  n(xt+\;  /3'iif+i  +  4>  (x,  -  f3'ut),  a2) . 


(9.8.23) 


The  case  a2  —  0  corresponds  to  a  log-linear  model  with  Poisson  noise. 

Estimation  of  the  parameters  6  =  (/3r,  0,  a2)  in  the  model  by  direct  numerical 
maximization  of  the  likelihood  function  is  difficult,  since  the  likelihood  cannot  be 
written  down  in  closed  form.  (From  (9.8.3)  the  likelihood  is  the  n- fold  integral, 


exp  • 


n 


t=\ 


L  (0;  x(n)) 


fl 


(dx  1  •  •  •  dxn)  /  ]~ [(>,!) 


i=l 
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Figure  9-6 

Monthly  number  of  U.S. 
cases  of  polio,  January 
1 970-December  1 983 


where  L(0;  x)  is  the  likelihood  based  on  Ii, ...  ,Xn.)  To  overcome  this  difficulty, 
Chan  and  Ledolter  (1995)  proposed  an  algorithm,  called  Monte  Carlo  EM  (MCEM), 
whose  iterates  9 ^  converge  to  the  maximum  likelihood  estimate.  To  apply  this 
algorithm,  first  note  that  the  conditional  distribution  of  Y(/7)  given  X(/7)  does  not  depend 
on  0,  so  that  the  likelihood  based  on  the  complete  data  (X(77)/,  Y(77)/)  is  given  by 

L  (0;  X(77),  Y{n))  =f  ( Y{n)\X{n) )  L  (0;  X(77)) . 

The  E-step  of  the  algorithm  (see  Section  9.7)  requires  calculation  of 

Q(0 |0(O)  =  E0u)  (In L(0;  X{n\  Y(77))|Y(77)) 

=  E0a)  (In  f(Y{n)\X{n))\Y{n))  +  E0( o  (in L(0;  X(77))|Y(77)) . 

We  delete  the  first  term  from  the  definition  of  Q ,  since  it  is  independent  of  6  and  hence 
plays  no  role  in  the  M-step  of  the  EM  algorithm.  The  new  Q  is  redefined  as 

Q(0 |0(i))  =  E0ii)  (In L(0;  X(77))|Y(/7)) .  (9.8.24) 

Even  with  this  simplification,  direct  calculation  of  Q  is  still  intractable.  Suppose 
for  the  moment  that  it  is  possible  to  generate  replicates  of  X(/7)  from  the  conditional 
distribution  of  X(w)  given  Y(,7)  when  6  —  6^l\  If  we  denote  m  independent  replicates  of 
X(77)  by  X(,/z),  . . . ,  X^,  then  a  Monte  Carlo  approximation  to  Q  in  (9.8.24)  is  given  by 

1  m 

G.  (»!«"')  =  -Etoi(9;X")- 

7=1 

The  M-step  is  easy  to  carry  out  using  Qm  in  place  of  Q  (especially  if  we  condition  on 
X\  =  0  in  all  the  simulated  replicates),  since  L  is  just  the  Gaussian  likelihood  of  the 
regression  model  with  AR(1)  noise  treated  in  Section  6.6.  The  difficult  steps  in  the 
algorithm  are  the  generation  of  replicates  of  X(77)  given  Y(77)  and  the  choice  of  m.  Chan 
and  Ledolter  (1995)  discuss  the  use  of  the  Gibb’s  sampler  for  generating  the  desired 
replicates  and  give  some  guidelines  on  the  choice  of  m. 

In  their  analyses  of  the  polio  data,  Zeger  (1988)  and  Chan  and  Ledolter  (1995) 
included  as  regression  components  an  intercept,  a  slope,  and  harmonics  at  periods  of 
6  and  12  months.  Specifically,  they  took 

ur  =  (1, 7/1000,  cos(27T7/12),  sin(27T7/12),  cos(27T7/6),  sin(27tt/6)y . 
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Figure  9-7 

Trend  estimate  for  the 
monthly  number  of 
U.S.  cases  of  polio, 
January  1  970-December 

1983 


The  implementation  of  Chan  and  Ledolter’s  MCEM  method  by  Kuk  and  Cheng  (1994) 
gave  estimates  (3  =(0.247,  —3.871,  0.162,  —0.482,  0.414,  —  O.Oliy,  0  =  0.648,  and 

*  A 

a2  —  0.281.  The  estimated  trend  function  (3fut  is  displayed  in  Figure  9-7.  The  negative 
coefficient  of  t/ 1000  indicates  a  slight  downward  trend  in  the  monthly  number  of  polio 
cases. 

□ 


9.8.2  Observation-Driven  Models 

Again  we  assume  that  Yu  conditional  on  (Xt,  X(r_1),  Y(r_1)),  is  independent  of 
(X(f_1),  Y(r_1)).  These  models  are  specified  by  the  conditional  densities 

p(y,\xt)=p(y,\x(t),y(,-l)),  t=  1.2,...,  (9.8.25) 

p(xt+i  |y(f))  =  pXl+l  |Yco  (xt+ 1  ly(r)) ,  t  =  0,1,...,  (9.8.26) 

where p(x\  |y(0>)  :=  p\  (xO  for  some  prespecified  initial  density  p\(x\).  The  advantage 
of  the  observation-driven  state  equation  (9.8.26)  is  that  the  posterior  distribution  of 
Xt  given  Y(/)  can  be  computed  directly  from  (9.8.4)  without  the  use  of  the  updating 
formula  (9.8.5).  This  then  allows  for  easy  computation  of  the  forecast  function 
in  (9.8.7)  and  hence  of  the  joint  density  function  of  (Y\,  . . . ,  Yn)\ 

n 

p(yu---,yn)  =  Y[p(yG(t~1})-  (9.8.27) 

7=1 

On  the  other  hand,  the  mechanism  by  which  the  state  Xt_\  makes  the  transition  to 
Xt  is  not  explicitly  defined.  In  fact,  without  further  assumptions  there  may  be  state 
sequences  {Yr}  and  {X* }  with  different  distributions  for  which  both  (9.8.25)  and 
(9.8.26)  hold  (see  Example  9.8.6).  Both  sequences,  however,  lead  to  the  same  joint 
distribution,  given  by  (9.8.27),  for  Y\,  . . . ,  Yn.  The  ambiguity  in  the  specification  of 
the  distribution  of  the  state  variables  can  be  removed  by  assuming  that  Xt+\  given 
(X(r),  Yw)  is  independent  of  X(t\  with  conditional  distribution  (9.8.26),  i.e., 

p  (-*9+1  ix(0’  y(0)  =  pXl+i  ,Y(«)  (x^+i  iy(f))  • 


(9.8.28) 
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Example  9.8.4. 


Example  9.8.5. 


With  this  modification,  the  joint  density  of  Y('!>  and  X('!|  is  given  by  (cf.  (9.8.3)) 


p  (y(n),  x(n))  =  p(y„\xn)p  (xn\y{n  !))  p  (y(n  *\  x("  J)) 


=  \\{p{yt\xt)p  {xt\y(t  1))). 

t=  1 


An  AR(1)  Process 


An  AR(1)  process  with  iid  noise  can  be  expressed  as  an  observation  driven  model. 
Suppose  { Y, }  is  the  AR(1)  process 


Yt  =  <t>Yt~\  +  Zu 


where  { Zt }  is  an  iid  sequence  of  random  variables  with  mean  0  and  some  probability 
density  function /(x).  Then  withXr  :=  Yt_  \  we  have 

piyMt)  —f( y’t  -  <Pxt) 


and 

1,  iixt+i=yt, 
0,  otherwise. 


p  (*»+i  ly°) 


□ 


Suppose  the  observation-equation  density  is  given  by 


P(yt \xt)  =  X,e  ,  yf  =  0,  1,...,  (9.8.29) 

yA 

and  the  state  equation  (9.8.26)  is 

P  {xt+\  |y(f))  =  g(xt;  a„  Xt),  (9.8.30) 

where  at  =  a  +  yi  +  •  •  •  +  yt  and  kt  =  X  +  t.  It  is  possible  to  give  a  parameter- 
driven  specification  that  gives  rise  to  the  same  state  equation  (9.8.30).  Let  {X* }  be  the 
parameter-driven  state  variables,  where  X*  —  X*_}  and  has  a  gamma  distribution 
with  parameters  a  and  A.  (This  corresponds  to  the  model  in  Example  9.8.2  with  n  = 
a  =  1.)  Then  from  (9.8.19)  we  see  that  p  (x*|y(/))  =  g(x*;  at,  kt),  which  coincides 
with  the  state  equation  (9.8.30).  If  {Xr}  are  the  state  variables  whose  joint  distribution  is 
specified  through  (9.8.28),  then  {Xr}  and  {X*}  cannot  have  the  same  joint  distributions. 
To  see  this,  note  that 


while 


i,  ir  x't+l  —  x\ , 

0,  otherwise, 


P  (xt+i  |XW,  y(,))  =  p  (xt+i  |y(0)  =  g(xt;  a„  X,). 


If  the  two  sequences  had  the  same  joint  distribution,  then  the  latter  density  could  take 
only  the  values  0  and  1,  which  contradicts  the  continuity  (as  a  function  of  xt)  of  this 
density. 


□ 
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9.8.3  Exponential  Family  Models 

The  exponential  family  of  distributions  provides  a  large  and  flexible  class  of  distribu¬ 
tions  for  use  in  the  observation  equation.  The  density  in  the  observation  equation  is 
said  to  belong  to  an  exponential  family  (in  natural  parameterization)  if 

PiyMt)  =  exp{y^  -  b(xt)  +  c(yt)},  (9.8.31) 

where  b(-)  is  a  twice  continuously  differentiable  function  and  c(yt)  does  not  depend 
on  xt.  This  family  includes  the  normal,  exponential,  gamma,  Poisson,  binomial,  and 
many  other  distributions  frequently  encountered  in  statistics.  Detailed  properties  of 
the  exponential  family  can  be  found  in  Barndorff-Nielsen  (1978),  and  an  excellent 
treatment  of  its  use  in  the  analysis  of  linear  models  is  given  by  McCullagh  and  Nelder 
(1989).  We  shall  need  only  the  following  important  facts: 


eb(x,)  =  J  exp {y,x,  +  c{yt)}  v{dyt). 

(9.8.32) 

b'(xt)  =  E(X,  \xt), 

(9.8.33) 

b"(xt)  =  Var(  Y,\xr)  :=  j  y)p(y, \xt)  v(dyt )  -  [b'(xr)f  , 

(9.8.34) 

where  integration  with  respect  to  v(dyt)  means  integration  with  respect  to  dyt  in  the 
continuous  case  and  summation  over  all  values  of  yt  in  the  discrete  case. 

Proof.  The  first  relation  is  simply  the  statement  that  p(yt \xt)  integrates  to  1.  The  second  rela¬ 
tion  is  established  by  differentiating  both  sides  of  (9.8.32)  with  respect  to  xt  and  then 
multiplying  through  by  e~bM  (for  justification  of  the  differentiation  under  the  integral 
sign  see  Barndorff-Nielsen  1978).  The  last  relation  is  obtained  by  differentiating 
(9.8.32)  twice  with  respect  to  xt  and  simplifying.  ■ 

Example  9.8.6.  The  Poisson  Case 

If  the  observation  Yu  given  Xt  —  xu  has  a  Poisson  distribution  of  the  form  (9.8.21), 
then 

p(yt\x,)  =  exp{yrxf  -  ex‘  -  lny,!},  y,  =  0,1,...,  (9.8.35) 

which  has  the  form  (9.8.31)  with  b(xt)  =  eXt  and  c(yt)  —  —  lnyr!.  From  (9.8.33) 
we  easily  find  that  E(Yt\xt)  —  b'(xt)  =  eXt.  This  parameterization  is  slightly  different 
from  the  one  used  in  Examples  9.8.2  and  9.8.5,  where  the  conditional  mean  of  Yt  given 
xt  was  7 ixt  and  not  eXt.  For  this  observation  equation,  define  the  family  of  densities 

f(x;  a ,  A)  =  Qxp{ax  —  Xb(x)  +A(a,  A)},  — oo  <  x  <  oo,  (9.8.36) 

where  a  >  0  and  A  >  0  are  parameters  and  A(a ,  A)  =  —In  r(a)  +  a  In  A.  Now 
consider  state  densities  of  the  form 

p(xt+ i|yw)  =f(xt+ 1;  at+i\ „  Xr+lj,),  (9.8.37) 

where  at+i\t  and  ht+\\t  are,  for  the  moment,  unspecified  functions  of  y(t).  (The  subscript 
t  +  1  \t  on  the  parameters  is  a  shorthand  way  to  indicate  dependence  on  the  conditional 
distribution  of  Xt+\  given  Y(/).)  With  this  specification  of  the  state  densities,  the 
parameters  at+\\t  are  related  to  the  best  one-step  predictor  of  Yt  through  the  formula 

ar+i\,/Xr+i]t  =  Yt+ 1  :=  E  (Fr+1|y(,))  . 


(9.8.38) 
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Proof. 


We  have  from  (9.8.7)  and  (9.8.33)  that 


00  poo 

E(Y,+ i|y(0)=  ^2  /  yt+ip(yt+i\xt+i)p  (xt+i\y(,))  dxt+i 

n  J  —  OO 


yt+\  =o 

OO 


/OO 

b' ( xt+ \)p(xt+\\y (t) )  dxt+ 1 . 

-OO 


Addition  and  subtraction  of  at+\\t/Xt+\\t  then  gives 

<Xt+l\t\ 


£(Fr+i|yw)  =  f  (b\xt+ 0  -  ^Ap^+ily^)  dxt+l  + 

J  — OO  \  ^t+l\t  / 


&t+l\t 


X 


f+1 1 1 


/OO 

—^+x\tPr  (xH-ily(t))  dxt+ 1  + 

-oo 

=  [-^+n,H*/+ilyw)_ 


xt+i=oo 


xt+i=-oo 


^t+l\t 
^t+l\t 


&t+l\t 


X 


t-\- 1 1 1 


Letting  At |,_i  =  A(a, \t-\,  a,|,_i ),  we  can  write  the  posterior  density  of  Xt  given 
Y(r)  as 

P  (*fly(0)  =  exp{yrxr  -  b(x,)  +  c(yt)}  exp{at]t_ix,  -  XAt-\b(xt) 

+  A,\t-i}/p(yt\y(,~l)) 

=  exp{Af|,  ( a,]txt  -  b(x,))  -  At]t}, 

-fix,-,  a„  xt), 

where  we  find,  by  equating  coefficients  of  xt  and  b(xt ),  that  the  coefficients  Xt  and  at 
are  determined  by 

Xt  =  1  Xt\t-\,  (9.8.39) 

oit  =yt  +  cit\t- 1.  (9.8.40) 

The  family  of  prior  densities  in  (9.8.37)  is  called  a  conjugate  family  of  priors  for 
the  observation  equation  (9.8.35),  since  the  resulting  posterior  densities  are  again 
members  of  the  same  family. 

As  mentioned  earlier,  the  parameters  at\t~  1  and  Xt\t-\  can  be  quite  arbitrary:  Any 
nonnegative  functions  of  y(r_1)  will  lead  to  a  consistent  specification  of  the  state 
densities.  One  convenient  choice  is  to  link  these  parameters  with  the  corresponding 
parameters  of  the  posterior  distribution  at  time  t  —  1  through  the  relations 

K+i\t  —  (—  <5(1  +  ^f|?-i))  5  (9.8.41) 

«f+i|f  =  Sa,  (=  8(y,  +  c%_i))  ,  (9.8.42) 

where  0  <  8  <  1  (see  Remark  4  below).  Iterating  the  relation  (9.8.41),  we  see  that 

K+i\t  —  <5(1  +  Xt\t_i)  =  8  +  SXt \t-\ 

—  8  +  8(8  +  8Xt_2\t-2) 
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—  8  -\-  8  +  +  <^Apo 

(9.8.43) 

Zo 

1 

H 

t 

as  t  -*  oo.  Similarly, 

at+i\t  —  8yt  +  8at\t-i 

—  8yt  +  82yt_i  +  •  •  •  +  8{yi  +  8ta  po- 

(9.8.44) 

For  large  t ,  we  have  the  approximations 

A,+p,  =  8/(1  -  8) 

and 

t- 1 

ott+\\t  =  sy28Jyt-j’ 

j= 0 

(9.8.45) 

(9.8.46) 

which  are  exact  if  Apo  =  8/(1  —  8)  and  apo  =  0.  From  (9.8.38)  the  one-step  predictors 
are  linear  and  given  by 


Yt+i  = 


<xt+i\t  YljJo  SJyt-j  + §t  1«no 


t- 1 


A 


t+\\t  E/=0  ^  T  *A 


(9.8.47) 


1|0 


Replacing  the  denominator  with  its  limiting  value,  or  starting  with  Apo  =  8/(1  —  8), 

A 

we  find  that  Yt+\  is  the  solution  of  the  recursions 

Yt+l  =  (l-S)yt  +  SYt,  t  —  1,2, ,  (9.8.48) 

with  initial  condition  Y\  =  (1  —  <5)5  apo.  In  other  words,  under  the  restrictions 
of  (9.8.41)  and  (9.8.42),  the  best  one-step  predictors  can  be  found  by  exponential 
smoothing. 

□ 


Remark  1.  The  preceding  analysis  for  the  Poisson-distributed  observation  equation 
holds,  almost  verbatim,  for  the  general  family  of  exponential  densities  (9.8.31).  (One 
only  needs  to  take  care  in  specifying  the  correct  range  for  x  and  the  allowable 
parameter  space  for  a  and  A  in  (9.8.37).)  The  relations  (9.8.43)-(9.8.44),  as  well  as 
the  exponential  smoothing  formula  (9.8.48),  continue  to  hold  even  in  the  more  general 
setting,  provided  that  the  parameters  at\t-i  and  ht\t-i  satisfy  the  relations  (9.8.41)- 
(9.8.42).  □ 

Remark  2.  Equations  (9.8.41)-(9.8.42)  are  equivalent  to  the  assumption  that  the  prior 
density  of  Xt  given  y(r-1)  is  proportional  to  the  8  -power  of  the  posterior  distribution  of 
Xt_\  given  or  more  succinctly  that 

f(xt\  at\t- 1,  Af|r_i)  =f(xt;  8at- pr_i,  5Ar_pr_i) 

ocf(xt;  Vi|r-i)- 

This  power  relationship  is  sometimes  referred  to  as  the  power  steady  model  (Grun- 
wald  et  al.  1993;  Smith  1979).  □ 

Remark  3.  The  transformed  state  variables  Wt  —  eXt  have  a  gamma  state  density 
given  by 

P  (wr+i|y(0)  =  g(wt+ 1;  at+i]t,  Yt+Mt) 
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(see  Problem  9.26).  The  mean  and  variance  of  this  conditional  density  are 

E(Wl+l\y(,))  =at+n,  and  Yar  (W,+i|y(,))  =  at+l\t/X2t+l]t.  □ 


Remark  4.  If  we  regard  the  random  walk  plus  noise  model  of  Example  9.2.1  as  the 
prototypical  state-space  model,  then  from  the  calculations  in  Example  9.8.1  with  G  = 
F  =  1,  we  have 

£(Xf+1|Yw)  =E(X,|Y(,)) 
and 

Var  (X,+i|Y(r))  =  Var  (Xr|Y(r))  +  Q  >  Var(Xr|Y(0) . 

The  first  of  these  equations  implies  that  the  best  estimate  of  the  next  state  is  the  same 
as  the  best  estimate  of  the  current  state,  while  the  second  implies  that  the  variance 
increases.  Under  the  conditions  (9.8.41),  and  (9.8.42),  the  same  is  also  true  for  the 
state  variables  in  the  above  model  (see  Problem  9.26).  This  was,  in  part,  the  rationale 
behind  these  conditions  given  in  Harvey  and  Fernandes  (1989).  □ 

Remark  5.  While  the  calculations  work  out  neatly  for  the  power  steady  model, 
Grunwald  et  al.  (1994)  have  shown  that  such  processes  have  degenerate  sample  paths 
for  large  t.  In  the  Poisson  example  above,  they  argue  that  the  observations  Yt  converge 
to  0  as  t  ->  oo  (see  Figure  9-12).  Although  such  models  may  still  be  useful  in 
practice  for  modeling  series  of  moderate  length,  the  efficacy  of  using  such  models 
for  describing  long-term  behavior  is  doubtful.  □ 

Example  9.8.7.  Goals  Scored  by  England  Against  Scotland 

The  time  series  of  the  number  of  goals  scored  by  England  against  Scotland  in  soccer 
matches  played  at  Hampden  Park  in  Glasgow  is  graphed  in  Figure  9-8.  The  matches 
have  been  played  nearly  every  second  year,  with  interruptions  during  the  war  years.  We 
will  treat  the  data y\ , . . . ,  y$2  as  coming  from  an  equally  spaced  time  series  model  {Yt}. 
Since  the  number  of  goals  scored  is  small  (see  the  frequency  histogram  in  Figure  9-9), 
a  model  based  on  the  Poisson  distribution  might  be  deemed  appropriate.  The  observed 
relative  frequencies  and  those  based  on  a  Poisson  distribution  with  mean  equal  to 
y52  =  1.269  are  contained  in  Table  9.2.  The  standard  chi-squared  goodness  of  fit  test, 
comparing  the  observed  frequencies  with  expected  frequencies  based  on  a  Poisson 
model,  has  a  p-\ alue  of  0.02.  The  lack  of  fit  with  a  Poisson  distribution  is  hardly 
unexpected,  since  the  sample  variance  (1.652)  is  much  larger  than  the  sample  mean, 
while  the  mean  and  variance  of  the  Poisson  distribution  are  equal.  In  this  case  the 
data  are  said  to  be  overdispersed  in  the  sense  that  there  is  more  variability  in  the  data 
than  one  would  expect  from  a  sample  of  independent  Poisson-distributed  variables. 
Overdispersion  can  sometimes  be  explained  by  serial  dependence  in  the  data. 

Dependence  in  count  data  can  often  be  revealed  by  estimating  the  probabilities  of 
transition  from  one  state  to  another.  Table  9.3  contains  estimates  of  these  probabilities, 
computed  as  the  average  number  of  one-step  transitions  from  state  yt  to  state  yt+  If 
the  data  were  independent,  then  in  each  column  the  entries  should  be  nearly  the  same. 
This  is  certainly  not  the  case  in  Table  9.3.  For  example,  England  is  very  unlikely  to  be 
shut  out  or  score  3  or  more  goals  in  the  next  match  after  scoring  at  least  three  goals  in 
the  previous  encounter. 
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Figure  9-8 

Goals  scored  by  England 
against  Scotland 
at  Hampden  Park, 
Glasgow,  1872-1987 


Figure  9-9 

Histogram  of  the 
data  in  Figure  9-8 
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goals 


Table  9.2  Relative  frequency  and  fitted  Poisson  distribution  of  goals  scored 

by  England  against  Scotland 


Number  of  goals 

0  1 

2 

3 

4 

5 

Relative  frequency 

0.288 

0.423 

0.154 

0.019 

0.096 

0.019 

Poisson  distribution 

0.281 

0.356 

0.226 

0.096 

0.030 

0.008 

Harvey  and  Fernandes  (1989)  model  the  dependence  in  this  data  using  an 
observation-driven  model  of  the  type  described  in  Example  9.8.6.  Their  model  assumes 
a  Poisson  observation  equation  and  a  log-gamma  state  equation: 


p(yt\xt )  = 


exp  [y,xt  -  ex'} 

yt}- 


p{xt\y°  1})  =f(xt;at\t-  i,V-i), 


—  OO  <  X  <  OO, 
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Table  9.3  Transition  probabilities  for  the  number  of  goals 

scored  by  England  against  Scotland 


Yt+ 1 

p(yt+ 1 1  yt) 

0 

1 

2 

>  3 

0 

0.214 

0.500 

0.214 

0.072 

yt 

1 

0.409 

0.272 

0.136 

0.182 

2 

0.250 

0.375 

0.125 

0.250 

>  3 

0 

0.857 

0.143 

0 

Table  9.4  Prediction  density  of  F53  given  Y(52)  for  data  in  Figure  9-7 


Number  of  goals 

0  1 

2 

3 

4 

5 

p(y53  ly(52)) 

0.472 

0.326 

0.138 

0.046 

0.013 

0.004 

for  t  =  1,2, ,  where  /  is  given  by  (9.8.36)  and  coio  =  0,  Apo  =  0.  The  power 
steady  conditions  (9.8.41)-(9.8.42)  are  assumed  to  hold  for  at\t_i  and  Xt\t~  1.  The  only 
unknown  parameter  in  the  model  is  <5.  The  log-likelihood  function  for  8  based  on  the 
conditional  distribution  of  y\, . . . ,  y$2  given  y\  is  given  by  [see  (9.8.27)] 

n—  1 

t  (<$,  y(n>)  =  E  In p  (yt+l  |y(r)) ,  (9.8.49) 

t=  1 

where  p  (yt+i  |y(r)))  is  the  negative  binomial  density  [see  Problem  9.25(c)] 

P  G+ilyW)  -nb(yt+ i;at+i|„  (1  +  , 

with  at+\\t  and  Xt+  \\t  as  defined  in  (9.8.44)  and  (9.8.43).  (For  the  goal  data,  y i  =  0, 
which  implies  ot2\ i  =0  and  hence  that  p  (y2lT(1))  is  a  degenerate  density  with  unit 
mass  at  y2  =  0.  Harvey  and  Fernandes  avoid  this  complication  by  conditioning  the 
likelihood  on  y^T\  where  r  is  the  time  of  the  first  nonzero  data  value.) 

A 

Maximizing  this  likelihood  with  respect  to  8 ,  we  obtain  8  =  0.844.  (Starting 
equations  (9.8.43)-(9.8.44)  with  apo  =  0  and  Apo  =  <5/(1  —  <5),  we  obtain 

A 

8  =  0.732.)  With  0.844  as  our  estimate  of  8 ,  the  prediction  density  of  the  next 
observation  F53  given  y(52)  is  nb(y$p,  CZ53152,  (1+A53|52)_1.  The  first  five  values  of  this 
distribution  are  given  in  Table  9.4.  Under  this  model,  the  probability  that  England 

A 

will  be  held  scoreless  in  the  next  match  is  0.471.  The  one-step  predictors,  Y\  — 

yv  yy 

0,  Y2, . . . ,  F52  are  graphed  in  Figure  9-10.  (This  graph  can  be  obtained  by  using  the 
ITSM  option  Smooth>Exponential  with  a  =  0.154.) 

Figures  9-11  and  9-12  contain  two  realizations  from  the  fitted  model  for  the  goal 
data.  The  general  appearance  of  the  first  realization  is  somewhat  compatible  with  the 
goal  data,  while  the  second  realization  illustrates  the  convergence  of  the  sample  path 
to  0  in  accordance  with  the  result  of  Grunwald  et  al.  (1994). 

□ 
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Figure  9-1 1 

A  simulated  time 
series  from  the  fitted 
model  to  the  goal  data 


Example  9.8.8.  The  Exponential  Case 

Suppose  Yt  given  Xt  has  an  exponential  density  with  mean  —l/Xt  (Xt  <  0).  The 
observation  density  is  given  by 

P(yt\xt)  =  exp{yfx,  +  ln(— x,)},  y,  >  0, 

which  has  the  form  (9.8.31)  with  b{x)  —  —  ln(— x)  and  c(y)  =  0.  The  state  densities 
corresponding  to  the  family  of  conjugate  priors  (see  (9.8.37))  are  given  by 

P  (x;+i|y(0)  =  exp{ar+i|rxr+i  -  Xl+lu  b(xl+l)  +At+l{t},  -oo  <  x  <  0. 

(Here  p{xt+  i|y(r))  is  a  probability  density  when  at+\\t  >  0  and  Xt+\\ t  >  —1.)  The 
one-step  prediction  density  is 
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Figure  9-12 

A  second  simulated  time 
series  from  the  fitted 
model  to  the  goal  data 


p  (>’/+ 1  ly 


e*t+\yt+ 1  +ln(— 1 )  +«?+  i\tx-Xt+i\  tb  (*)  +At+  \\t 


=  (K+i\t  +  l)at++1|['+10'f+1  +at+i|t)  X,+lv  2,  yt+\  >  0 

(see  Problem  9.28).  While  E(Yt+ 1  |y(r))  =  at+\\t/\t+\\t,  the  conditional  variance  is  finite 
if  and  only  if  Xt+\\ t  >  1.  Under  assumptions  (9.8.41)-(9.8.42),  and  starting  with  Apo  = 
8/(1  —  8 ),  the  exponential  smoothing  formula  (9.8.48)  remains  valid. 

□ 


Problems 


9.1  Show  that  if  all  the  eigenvalues  of  F  are  less  than  1  in  absolute  value  (or 
equivalently  that  Fk  -*  0  as  k  oo),  the  unique  stationary  solution  of  equation 
(9.1.11)  is  given  by  the  infinite  series 

oo 

1 

j= 0 

and  that  the  corresponding  observation  vectors  are 

oo 

v  =  w  t+J2GFjVt-j-i- 

j= 0 

Deduce  that  {(X^,  Y^)7}  is  a  multivariate  stationary  process.  (Hint:  Use  a  vector 
analogue  of  the  argument  in  Example  2.2.1.) 

9.2  In  Example  9.2.1,  show  that  6  —  —  1  if  and  only  if  av2  =  0,  which  in  turn  is 
equivalent  to  the  signal  Mt  being  constant. 

9.3  Let  F  be  the  coefficient  of  Xt  in  the  state  equation  (9.3.4)  for  the  causal  AR (p) 
process 

x,  -  (p\Xf—\ - 4>pXt_p  =  Zt,  {Ztj  ~  WN  (0,  a2)  . 
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Establish  the  stability  of  (9.3.4)  by  showing  that 
det  (zl  —  F)  —  zp(j)  (z_1) , 

and  hence  that  the  eigenvalues  of  F  are  the  reciprocals  of  the  zeros  of  the 
autoregressive  polynomial  4>{z)  —  1  —  <fi\Z  —  •  •  •  —  4>pzp . 

9.4  By  following  the  argument  in  Example  9.3.3,  find  a  state- space  model  for  {Yt} 
when  {VV12FJ  is  an  ARM  A  (p,  q)  process. 

9.5  For  the  local  linear  trend  model  defined  by  equations  (9.2.6)-(9.2.7),  show  that 
X2Yt  =  (1  —  B)2Yt  is  a  2-correlated  sequence  and  hence,  by  Proposition  2.1.1, 
is  an  MA(2)  process.  Show  that  this  MA(2)  process  is  noninvertible  if  a2  =  0. 

9.6  a.  For  the  seasonal  model  of  Example  9.2.2,  show  that  XpYt  =  Yt  —  Yt_p  is  an 
MA(1)  process. 

b.  Show  that  VVjFr  is  an  MA(d  +1)  process  where  {Yt}  follows  the  seasonal 
model  with  a  local  linear  trend  as  described  in  Example  9.2.3. 


9.7  Let  { Yt }  be  the  MA(1)  process 

Y,  =  Z,  +  ez,_u  {Z,}  ~  WN  (0,  a2)  . 
Show  that  {Yt}  has  the  state-space  representation 
Yt  =  [  1  0]X„ 

where  {Xt}  is  the  unique  stationary  solution  of 


1 

e 


In  particular,  show  that  the  state  vector  Xt  can  written  as 


'1  9 

'  z, ' 

Q  0. 

z,-,_ 

9.8  Verify  equations  (9.3. 16)— (9.3. 18)  for  an  ARIMA(1,1,1)  process. 

9.9  Consider  the  two  state-space  models 

X/+14  =  F  iXfi  +  Vu, 

^  Ytl  =  G{Xt\  +  W?1, 

and 

xt+ 1,2=  F{Xt2  +  \t2, 

^  Yr2  =  G2X,2  +  Wr2, 

where  {( \'tl ,  Wr?1,  \'t2 ,  Wt?)f}  is  white  noise.  Derive  a  state-space  representation 
for  {(Y'j,  Y'2)'}. 

9.10  Use  Remark  1  of  Section  9.4  to  establish  the  linearity  properties  of  the  operator 
Pt  stated  in  Remark  3. 

9.11  a.  Show  that  if  the  matrix  equation  XS=B  can  be  solved  for  X ,  then  X=BS~{ 

is  a  solution  for  any  generalized  inverse  S~{  of  S. 

b.  Use  the  result  of  (a)  to  derive  the  expression  for  P(X|Y)  in  Remark  4  of 
Section  9.4. 
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9.12  In  the  notation  of  the  Kalman  prediction  equations,  show  that  every  vector  of 
the  form 

Y  =  A\X\  +  •  •  •  +  AtXt 
can  be  expressed  as 

Y  =  #1X1  +  •  •  •  +  Bt-  iXf_i  +  Ctlt, 

where  B 1,  . . . ,  Bt_\  and  Ct  are  matrices  that  depend  on  the  matrices  A\, ...  ,At. 
Show  also  that  the  converse  is  true.  Use  these  results  and  the  fact  that  £'(XiSIr)  = 
0  for  all  s  <  t  to  establish  (9.4.3). 

9.13  In  Example  9.4.1,  verify  that  the  steady-state  solution  of  the  Kalman  recursions 
(9.1.2)  is  given  by  Q,  =  (ct2  +  y  ct,4  +  4ct2ct2  j  /2. 

9.14  Show  from  the  difference  equations  for  Qt  in  Example  9.4.1  that  (£2^ \  — 
£2)  (£2  £2)  >  0  for  all  F2t  >  0,  where  F2  is  the  steady-state  solution  for  Qt  given  in 
Problem  9.13. 

9.15  Show  directly  that  for  the  MA(1)  model  (9.2.3),  the  parameter  6  is  equal  to 

-  (2(t2  +  a2  -  y ct4  +  4ctv2ct2)  /  (2a2),  which  in  turn  is  equal  to  + 

cr^),  where  F2  is  the  steady-state  solution  for  Z2t  given  in  Problem  9.13. 

9.16  Use  the  ARMA(0,1,1)  representation  of  the  series  {Yt}  in  Example  9.4.1  to  show 
that  the  predictors  defined  by 

A  A 

Yn+ 1  =  aYn  +  (1  -  a)Yn,  n  =  1,2,..., 
where  a  =  Y2 /{Q  +  ct2),  satisfy 

Yn+1  -  Yn+ 1  =  Zn+1  +  (1  -  a)n  (y0  -  zb  -  Fi)  . 

Deduce  that  if  0  <  a  <  1,  the  mean  squared  error  of  Yn+\  converges  to  Q  +  a~ 

✓V 

for  any  initial  predictor  Y\  with  finite  mean  squared  error. 

yv 

9.17  a.  Using  equations  (9.4.1)  and  (9.4.16),  show  that  Xr+i  =  FtXt\t. 
b.  From  (a)  and  (9.4.16)  show  that  Xt\t  satisfies  the  recursions 

Xr|f  =  Ft-\Xt- \\t-i  +  QtG\At  1  (Yt  —  GtF t_{Xt_i\t_i) 

for  t  —  2,  3,  ... ,  with  X\\\  =  X\  +  £2\G^A^  '  —  GiX^. 

9.18  In  Section  9.5,  show  that  for  fixed  2*,  — 21nL  (/!,,  2*,  a is  minimized  when 
fi  and  cr^  are  given  by  (9.5.10)  and  (9.5.11),  respectively. 

9.19  Verify  the  calculation  of  and  F2t  in  Example  9.6.1. 

9.20  Verify  that  the  best  estimates  of  missing  values  in  an  AR (p)  process  are  found 
by  minimizing  (9.6.11)  with  respect  to  the  missing  values. 

9.21  Suppose  that  {Yt}  is  the  AR(2)  process 

Yt  =  fa  7-1  +  fa  7-2  +  Zt,  {Ztj  ~  WN  (0,  CT2) , 
and  that  we  observe  Y\,  Y2,  Y4,  F5,  Y^,  Y1.  Show  that  the  best  estimator  of  F3  is 

(0 liXl  +  I5)  +  (01  —  0102) (1^2  +  I4))  /  (l  +  01  +  02 )  • 
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9.22  Let  Xt  be  the  state  at  time  t  of  a  parameter-driven  model  (see  (9.8.2)).  Show  that 
{Xr}  is  a  Markov  chain  and  that  (9.8.3)  holds. 

9.23  For  the  generalized  state-space  model  of  Example  9.8.1,  show  that  C2t+ \  — 
F2^t\t  +  Q- 

9.24  If  F  and  X  are  random  variables,  show  that 


Var(F)  =  £(Var(F|X))  +  Var(£(F|X)). 

9.25  Suppose  that  F  and  X  are  two  random  variables  such  that  the  distribution  of  F 
given  X  is  Poisson  with  mean  nX ,  0  <  n  <  1,  and  X  has  the  gamma  density 
g(x;  a,  A). 

a.  Show  that  the  posterior  distribution  of  X  given  F  also  has  a  gamma  density 
and  determine  its  parameters. 

b.  Compute  E(X\Y)  and  Var(X|F). 

c.  Show  that  F  has  a  negative  binomial  density  and  determine  its  parameters. 

d.  Use  (c)  to  compute  E(Y)  and  Var(F). 

e.  Verify  in  Example  9.8.2  that  E(Yt+i\Y^  =  atn/{kt+  \  —  i r)  and 
Var(y,+1|Y(r))  =  atn  Xt+x  /  {Xt+x  -  n)1 . 

9.26  For  the  model  of  Example  9.8.6,  show  that 

a.  E(Xt+i  |YW)  =E(Xt\Y{t)),  Var(Xm|Yw)  >Var(Xf|Y(r)),  and 

b.  the  transformed  sequence  W,  =  ex'  has  a  gamma  state  density. 


9.27  Let  {Vt}  be  a  sequence  of  independent  exponential  random  variables  with  EVt  = 
t~l  and  suppose  that  {Xt,  t  >  1}  and  {Yt,  t  >  1}  are  the  state  and  observation 
random  variables,  respectively,  of  the  parameter-driven  state-space  system 

Xi  =  Vu 


Xt  =  t  =  2,  3, 


where  the  distribution  of  the  observation  Yt ,  conditional  on  the  random  variables 
Fi,  F2,  . . . ,  Yt-i,Xt,  is  Poisson  with  memXt. 

a.  Determine  the  observation  and  state  transition  density  functions  p(yt\xt)  and 
p(xt+ 1  \xt)  in  the  parameter-driven  model  for  {Fr}. 

b.  Show,  using  (9.8.4)-(9.8.6),  that 

P(x\\yi)  =g(xi;yi  +  1,2) 
and 

p(x2\yi)  =  gfe;  yi  +2, 2), 

where  g(v;  a ,  A)  is  the  gamma  density  function  (see  Example  (d)  of  Sec¬ 
tion  A.l). 

c.  Show  that 

P  (*r |y(0)  =  g(xt;  at  +  t,t+  1) 
and 

p(++i|y(0)  =  g(xt+i\  olx  + 1+  l  ,t+  l), 
where  at  =  y\  +  •  •  •  +  yt. 
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d.  Conclude  from  (c)  that  the  minimum  mean  squared  error  estimates  of  Xt  and 
Xt+\  based  on  Y\,  . . . ,  Yt  are 


t  +  Y\  +  •  •  •  +  Yt 
t  +  1 


and 


t  +  1  +  Y\  +  •  •  •  +  Yt 


t  +  1 


respectively. 


9.28  Let  Y  and  X  be  two  random  variables  such  that  Y  given  X  is  exponential  with 
mean  1  / X ,  and  X  has  the  gamma  density  function  with 

ax+lxx  expl— ax} 

g(x;  X  +  1,  a)  — - ^  x  >  0, 

5  r(k  + 1) 

where  X  >  —  1  and  a  >  0. 

a.  Determine  the  posterior  distribution  of  X  given  Y. 

b.  Show  that  Y  has  a  Pareto  distribution 

p(y)  =  (A.  +  l)ax+1  (y  +  a)-"-2,  y  >  0. 

c.  Find  the  mean  and  variance  of  Y.  Under  what  conditions  on  a  and  X  does  the 
latter  exist? 

d.  Verify  the  calculation  of  p{yt+  i|y(r))  and  E  (Fr+i|y(r))  for  the  model  in 
Example  9.8.8. 


9.29  Consider  an  observation-driven  model  in  which  Yt  given  Xt  is  binomial  with 
parameters  n  and  Xu  i.e., 

p(yt \xt)  =  -xt)n~y\  y,  =  0,  1, . . .  ,n. 

a.  Show  that  the  observation  equation  with  state  variable  transformed  by  the 
logit  transformation  Wt  =  ln(Xt/(l  —  Xt ))  follows  an  exponential  family 

p(yt\w,)  =  exp {ytw,  -  b{wt)  +  c(yt)}. 

Determine  the  functions  b(-)  and  c(-). 

b.  Suppose  that  the  state  Xt  has  the  beta  density 

p(xt+ i|y(r))  =f(xt+l;at+i\t,  K+i\t), 

where 

/(x;  a ,  X)  =  [B(a,  X)]~lxa~l(l  —  x)x~] ,  0  <  x  <  1, 

B(a ,  X)  :=  r(a)r(X)/r(a  +  X)  is  the  beta  function,  and  a,  X  >  0.  Show  that 
the  posterior  distribution  of  Xt  given  Yt  is  also  beta  and  express  its  parameters  in 
terms  of  yt  and  at\t-u  K\t- 1- 

c.  Under  the  assumptions  made  in  (b),  show  that  =  Zs^+ilY^) 

and  Var(Xf|Y(r))  <Var(X,+1|Y(r)). 

d.  Assuming  that  the  parameters  in  (b)  satisfy  (9.8.41)-(9.8.42),  show  that  the  one- 
step  prediction  density  p(yt+ 1  \y(t))  is  beta-binomial, 


p(yt+i\y(t)) 


B(at+i\,  +  yt+ 1,  Xt+i \t  +  n-  yf+i) _ 

(n  +  l)B(yt+1  +  1,  n  -  yt+ 1  +  l)5(at+i|„  A.t+i!f)  ’ 


/V 

and  verify  that  Yt+\  is  given  by  (9.8.47). 


Forecasting  Techniques 


1 0.1  The  ARAR  Algorithm 

1 0.2  The  Holt-Winters  Algorithm 

1 0.3  The  Holt-Winters  Seasonal  Algorithm 

1 0.4  Choosing  a  Forecasting  Algorithm 


We  have  focused  until  now  on  the  construction  of  time  series  models  for  stationary 
and  nonstationary  series  and  the  determination,  assuming  the  appropriateness  of  these 
models,  of  minimum  mean  squared  error  predictors.  If  the  observed  series  had  in 
fact  been  generated  by  the  fitted  model,  this  procedure  would  give  minimum  mean 
squared  error  forecasts.  In  this  chapter  we  discuss  three  forecasting  techniques  that 
have  less  emphasis  on  the  explicit  construction  of  a  model  for  the  data.  Each  of  the 
three  selects,  from  a  limited  class  of  algorithms,  the  one  that  is  optimal  according  to 
specified  criteria. 

The  three  techniques  have  been  found  in  practice  to  be  effective  on  wide  ranges 
of  real  data  sets  (for  example,  the  economic  time  series  used  in  the  forecasting  com¬ 
petition  described  by  Makridakis  et  al.  1984). 

The  ARAR  algorithm  described  in  Section  10.1  is  an  adaptation  of  the  ARARMA 
algorithm  (Newton  and  Parzen  1984;  Parzen  1982)  in  which  the  idea  is  to  apply  auto¬ 
matically  selected  “memory- shortening”  transformations  (if  necessary)  to  the  data 
and  then  to  fit  an  ARMA  model  to  the  transformed  series.  The  ARAR  algorithm  we 
describe  is  a  version  of  this  in  which  the  ARMA  fitting  step  is  replaced  by  the  fitting 
of  a  subset  AR  model  to  the  transformed  data. 

The  Holt-Winters  (HW)  algorithm  described  in  Section  10.2  uses  a  set  of  simple 
recursions  that  generalize  the  exponential  smoothing  recursions  of  Section  1.5.1  to 
generate  forecasts  of  series  containing  a  locally  linear  trend. 
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The  Holt-Winters  seasonal  (HWS)  algorithm  extends  the  HW  algorithm  to  handle 
data  in  which  there  are  both  trend  and  seasonal  variation  of  known  period.  It  is 
described  in  Section  10.3. 

Each  of  these  three  algorithms  can  be  applied  to  specific  data  sets  with  the  aid  of 
the  ITSM  options  Forecast ing>ARAR,  Forecast ing>Holt -Winters  and 
Fore casting> Seasonal  Holt -Winters. 


1 0.1  The  ARAR  Algorithm 


1 0.1 .1  Memory  Shortening 

Given  a  data  set  {Yt,  t  —  1,  2, . . . ,  n},  the  first  step  is  to  decide  whether  the  underlying 
process  is  “long-memory,”  and  if  so  to  apply  a  memory- shortening  transformation  be¬ 
fore  attempting  to  fit  an  autoregressive  model.  The  differencing  operations  permit¬ 
ted  under  the  option  Transform  of  ITSM  are  examples  of  memory- shortening 
transformations;  however,  the  ones  used  by  the  option  Forecast  ing>ARAR  selects 
are  members  of  a  more  general  class.  There  are  two  types  allowed: 

Y,  =  Y,  -  4>  (f)  Yt-i  (10.1.1) 

and 

Yt=  Yt-faY^  -027,-2.  (10.1.2) 

With  the  aid  of  the  five-step  algorithm  described  below,  we  classify  {Yt}  and  take 
one  of  the  following  three  courses  of  action: 


•  L.  Declare  { Yt }  to  be  long-memory  and  form  {Fr}  using  (10.1.1). 

•  M.  Declare  { Yt }  to  be  moderately  long-memory  and  form  {Fr}  using  (10.1.2). 

•  S.  Declare  {FJ  to  be  short-memory. 


If  the  alternative  L  or  M  is  chosen,  then  the  transformed  series  {Fj  is  again 
checked.  If  it  is  found  to  be  long-memory  or  moderately  long-memory,  then  a  further 
transformation  is  performed.  The  process  continues  until  the  transformed  series  is 
classified  as  short-memory.  At  most  three  memory- shortening  transformations  are 
performed,  but  it  is  very  rare  to  require  more  than  two.  The  algorithm  for  deciding 
among  L,  M,  and  S  can  be  described  as  follows: 


✓V 

1.  For  each  r  =  1,  2,  . . . ,  15,  we  find  the  value  0( r)  of  0  that  minimizes 

Zlr+l[Yt  -  <PYt-r]2 


ERR(<£,  r)  = 


V/z  y2 

Z-^t= r+1  1 t 


We  then  define 


Err(r)  =  ERR(0(r),  r) 

and  choose  the  lag  f  to  be  the  value  of  r  that  minimizes  Err(r). 

2.  If  Err(f)  <  8 /ft,  go  to  L. 

3.  If  0(f)  >  0.93  and  f  >  2,  go  to  L. 

✓V  y  V  yv  yy 

4.  If  0(f)  >0.93  and  f  =  I  or  2,  determine  the  values  <p\  and  02  of  0i  and  02  that 
minimize  1 Y,  -  frY, _i  -  (/>2Fr_2]2;  then  go  to  M. 

5.  If  4>(t)  <  0.93,  go  to  S. 
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1 0.1 .2  Fitting  a  Subset  Autoregression 

Let  { St ,  t  =  k  +  1,  . . . ,  n)  denote  the  memory- shortened  series  derived  from  {Yt}  by 
the  algorithm  of  the  previous  section  and  let  S  denote  the  sample  mean  of  S*+i ,  . . . ,  Sn. 

The  next  step  in  the  modeling  procedure  is  to  fit  an  autoregressive  process  to  the 
mean-corrected  series 

Xf  —  Sf  —  S ,  t  —  k  1 ,  . . . ,  it. 

The  fitted  model  has  the  form 

Xt  =  01^-1  +  0/1Xr_/1  +  <fil2Xt_l2  +  0/3^_/3  +  Zt, 

where  {Zt}  ~  WN  (0,  a2),  and  for  given  lags,  Zi,  I2 ,  and  Z3,  the  coefficients  0/  and  the 
white  noise  variance  a2  are  found  from  the  Yule-Walker  equations 


1 

1— i ' 

1 

K) 

1 

OJ 

1 

_ 1 

01 

-P(1>- 

P(h  ~  1)  1  Pih  ~  h)  P(h  ~  h) 

0/l 

Pih) 

Kh- 1)  Pih-h)  1  Pih-h) 

0/2 

Pih) 

Jih-  1)  Pih-h)  Pih-h)  1  _ 

-0/3- 

Jih). 

and 

a2  =  y(0)  [1  -  fapCl)  -  ct)hp{h)  -  c])hp{h)  -  <Php(h)]  , 

where  y(j)  and  p(j),j  =  0,  1,  2,  ,  are  the  sample  autocovariances  and  autocor¬ 

relations  of  the  series  {Yr}. 

The  program  computes  the  coefficients  0y  for  each  set  of  lags  such  that 
1  <  li  <  I2  <  h  <  m, 

where  m  can  be  chosen  to  be  either  13  or  26.  It  then  selects  the  model  for  which  the 
Yule- Walker  estimate  a2  is  minimal  and  prints  out  the  lags,  coefficients,  and  white 
noise  variance  for  the  fitted  model. 

A  slower  procedure  chooses  the  lags  and  coefficients  (computed  from  the  Yule- 
Walker  equations  as  above)  that  maximize  the  Gaussian  likelihood  of  the  observations. 
For  this  option  the  maximum  lag  m  is  13. 

The  options  are  displayed  in  the  ARAR  Forecasting  dialog  box,  which 
appears  on  the  screen  when  the  option  Forecasting>ARAR  is  selected.  It  allows 
you  also  to  bypass  memory  shortening  and  fit  a  subset  AR  to  the  original  (mean- 
corrected)  data. 


10.1.3  Forecasting 

If  the  memory-shortening  filter  found  in  the  first  step  has  coefficients  00 (=  1)> 
0i,  . . . ,  0>  (k  >  0),  then  the  memory-shortened  series  can  be  expressed  as 

5,  =  0(5)7,  —  Yt  +  x!fXYt_x  +  •  •  •  +  0*7, (10.1.3) 

where  0  ( B )  is  the  polynomial  in  the  backward  shift  operator, 

if(B)  =  1  +  ^1  B  +  ---  +  ifkBk. 

Similarly,  if  the  coefficients  of  the  subset  autoregression  found  in  the  second  step  are 
0i,  0,1?  0/2 ,  and  0/3,  then  the  subset  AR  model  for  the  mean-corrected  series  { Xt  = 

St  -  S}  is 

q HB)Xt  =  Zt ,  (10.1.4) 
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Example  10.1.1 


where  {Zt}  ~  WN  (0,  a2)  and 

4>(B)  =  1  -  0i B  -  4>hB1'  -  <t>hBh  -  (f>hBh. 

From  (10.1.3)  and  (10.1.4)  we  obtain  the  equations 

$(B)Yt  =  0(1)5  +  Z„  (10.1.5) 

where 

$(B)  =  f(B)<p{B)  =  1  +  &B  +  •  •  •  +  &+/3fl*+iC 

Assuming  that  the  fitted  model  (10.1.5)  is  appropriate  and  that  the  white  noise 
term  Zt  is  uncorrelated  with  [Yj,  j  <  t}  for  each  t ,  we  can  determine  the  minimum 
mean  squared  error  linear  predictors  Pn Yn+h  of  Yn+h  in  terms  of  {1,  Y\, . . . ,  Yn),  for 
n  >  k  +  Z3,  from  the  recursions 

PnYn+h  =  tjPnYn+h-j  +  0(1)5,  h  >  1,  (10.1.6) 

7—1 

with  the  initial  conditions 


FnXn+h  —  f/i+/z  5  for  h  <  0. 


(10.1.7) 


The  mean  squared  error  of  the  predictor  PnYn+h  is  found  to  be  (Problem  10.1) 

h- 1 

£  [(y„+/)  -  £„F„+/J)2]  =  £  xfo2,  (10.1.8) 

7=0 


where  xjz'-  the  Taylor  expansion  of  1  /£  (z)  in  a  neighborhood  of  z  =  0. 
Equivalently  the  sequence  {r7}  can  be  found  from  the  recursion 


To  —  1?  ^  '  T j^n—j  — 
7=0 


(10.1.9) 


1 0.1 .4  Application  of  the  ARAR  Algorithm 

To  determine  an  ARAR  model  for  a  given  data  set  {Yt}  using  ITSM,  select  Fore¬ 
cast  ing>  ARAR  and  choose  the  appropriate  options  in  the  resulting  dialog 
box.  These  include  specification  of  the  number  of  forecasts  required,  whether  or 
not  you  wish  to  include  the  memory- shortening  step,  whether  you  require  prediction 
bounds,  and  which  of  the  optimality  criteria  is  to  be  used.  Once  you  have  made 
these  selections,  click  OK,  and  the  forecasts  will  be  plotted  with  the  original  data. 
Right-click  on  the  graph  and  then  Info  to  see  the  coefficients  1,  0r,  . . . ,  ^  of  the 
memory-shortening  filter  0(/?),  the  lags  and  coefficients  of  the  subset  autoregression 

xt  —  </>  \Xt_i  —  4>ixxt-ix  —  (j)i2xt_i2  —  0/3xf_/3  =  zr, 

and  the  coefficients  i=j  of  &  in  the  overall  whitening  filter 

S(B)  =  (1  +  fxB  +  •  •  •  +  fkBk )  (1  -  4>XB  -  <f>hBh  -  4>hBh  -  d>hBh)  . 

The  numerical  values  of  the  predictors,  their  root  mean  squared  errors,  and  the  pre¬ 
diction  bounds  are  also  printed. 

To  use  the  ARAR  algorithm  to  predict  24  values  of  the  accidental  deaths 
data,  open  the  file  DEATHS. TSM  and  proceed  as  described  above.  Selecting 
Minimize  WN  variance  [max  lag=2  6]  gives  the  graph  of  the  data  and 
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Figure  10-1 

The  data  set  DEATHS. TSM 
with  24  values  predicted  by 
the  ARAR  algorithm 


predictors  shown  in  Figure  10-1.  Right-clicking  on  the  graph  and  then  Info,  we 
find  that  the  selected  memory-shortening  filter  is  (l  —  0.9779Z?12).  The  fitted  subset 
autoregression  and  the  coefficients  ^  of  the  overall  whitening  filter  %(B)  are  shown 
below:  □ 

Optimal  lags  1  3  12  13 

Optimal  coeffs  0.5915  —0.3822  —0.3022  0.2970 

WN  Variance:  0.12314E+06 
COEFFICIENTS  OF  OVERALL  WHITENING  FILTER: 


1 .0000 

-0.5915 

0.0000 

-0.2093 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

-0.6757 

0.2814 

0.0000 

0.2047 

0.0000 

0.0000 

0.0000 

0.0000 

0.0000 

0.2904 

0.0000 

0.0000 

0.0000 

-0.2955 

□ 

In  Table  10.1  we  compare  the  predictors  of  the  next  six  values  of  the  accidental 
deaths  series  with  the  actual  observed  values.  The  predicted  values  obtained  from 
ARAR  as  described  in  the  example  are  shown  together  with  the  predictors  obtained 
by  fitting  ARIMA  models  as  described  in  Chapter  6  (see  Table  10.1).  The  observed 

root  mean  squared  errors  (i.e.,  (Yn+h— PuYn+h)2 /&  )  for  the  three  prediction 

methods  are  easily  calculated  to  be  253  for  ARAR,  583  for  the  ARIMA  model  (6.5.8), 
and  501  for  the  ARIMA  model  (6.5.9).  The  ARAR  algorithm  thus  performs  very 
well  here.  Notice  that  in  this  particular  example  the  ARAR  algorithm  effectively  fits 
a  causal  AR  model  to  the  data,  but  this  is  not  always  the  case. 
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10.2  The  Holt-Winters  Algorithm 

1 0.2.1  The  Algorithm 

Given  observations  Y\,  Y2,  . . . ,  Yn  from  the  “trend  plus  noise”  model  (1.5.2),  the 
exponential  smoothing  recursions  (1.5.7)  allowed  us  to  compute  estimates  mt  of 
the  trend  at  times  t  —  1,  2,  . . . ,  n.  If  the  series  is  stationary,  then  mt  is  constant  and  the 
exponential  smoothing  forecast  of  Yn+h  based  on  the  observations  Y\, . . . ,  Yn  is 

PnYn+h  =  h—  1,2, -  (10.2.1) 

If  the  data  have  a  (nonconstant)  trend,  then  a  natural  generalization  of  the  forecast 
function  (10.2.1)  that  takes  this  into  account  is 

PnYn+h  =  an  +  bnh ,  h  —  1,2,...,  (10.2.2) 

where  an  and  bn  can  be  thought  of  as  estimates  of  the  “level”  an  and  “slope”  bn  of 
the  trend  function  at  time  n.  Holt  (1957)  suggested  a  recursive  scheme  for  computing 

A  A 

the  quantities  an  and  bn  in  (10.2.2).  Denoting  by  Yn+\  the  one-step  forecast  PnYn+ 1,  we 
have  from  (10.2.2) 

A  ^  A 

Yn+1  —  bn  +  bn. 

Now,  as  in  exponential  smoothing,  we  suppose  that  the  estimated  level  at  time  n  +  1 
is  a  linear  combination  of  the  observed  value  at  time  n  +  1  and  the  forecast  value  at 
time  n  +  1 .  Thus, 

an+ 1  =  aYn+ 1  +  (1  -  o' ) (an  +  bn).  (10.2.3) 

We  can  then  estimate  the  slope  at  time  n  +  1  as  a  linear  combination  of  an+ \  —  an  and 

/V 

the  estimated  slope  bn  at  time  n.  Thus, 

bn+ 1  =  P  (an+ 1  -  a„)  +  (1  -  P)bn.  (10.2.4) 

In  order  to  solve  the  recursions  (10.2.3)  and  (10.2.4)  we  need  initial  conditions. 
A  natural  choice  is  to  set 

a2  =  Y2  (10.2.5) 

and 

b2  —  Y2  —  Y\.  (10.2.6) 

Then  (10.2.3)  and  (10.2.4)  can  be  solved  successively  for  at  and  bL,  i  =  3, . . . ,  n,  and 
the  predictors  PnYn+h  found  from  (10.2.2). 

The  forecasts  depend  on  the  “smoothing  parameters”  a  and  /3.  These  can  either 
be  prescribed  arbitrarily  (with  values  between  0  and  1)  or  chosen  in  a  more  systematic 
way  to  minimize  the  sum  of  squares  of  the  one-step  errors  ^'7=3(F/— P/_iF/)2,  obtained 


Table  10.1  Predicted  and  observed  values  of  the  accidental  deaths  series  for  t  =  73, . . . ,  78 


t 

73 

74 

75 

76 

77 

78 

Observed  Yj 

7798 

7406 

8363 

8460 

9217 

9316 

Predicted  by  ARAR 

8168 

7196 

7982 

8284 

9144 

9465 

Predicted  by  (6.5.8) 

8441 

7704 

8549 

8885 

9843 

10,279 

Predicted  by  (6.5.9) 

8345 

7619 

8356 

8742 

9795 

10,179 

1 0.2  The  Holt-Winters  Algorithm 


315 


Example  10.2.1 


when  the  algorithm  is  applied  to  the  already  observed  data.  Both  choices  are  available 
in  the  ITSM  option  Forecast ing>Holt  -Winters. 

Before  illustrating  the  use  of  the  Holt-Winters  forecasting  procedure,  we  discuss 
the  connection  between  the  recursions  (10.2.3)  and  (10.2.4)  and  the  steady-state 
solution  of  the  Kalman  filtering  equations  for  a  local  linear  trend  model.  Suppose  {Yt} 
follows  the  local  linear  structural  model  with  observation  equation 

Yt  =  Mt  +  Wt 


and  state  equation 


~m,+ r 

'1  r 

+ 

~Vt~ 

_Bt+ 1 . 

.0 1. 

A. 

U,_ 

[see  (9.2.4)-(9.2.7)].  Now  define  an  and  bn  to  be  the  filtered  estimates  of  Mn  and  Bn , 
respectively,  i.e., 

CLn 


Using  Problem  9.17  and  the  Kalman  recursion  (9.4.16),  we  find  that 


(10.2.7) 


where  G  =  [l  0].  Assuming  that  fin—fi\—[fiij^j=l  is  the  steady-state  solution 

of  (9.4.2)  for  this  model,  then  An—fin  +  a2  for  all  n,  so  that  (10.2.7)  simplifies  to 
the  equations 


and 


^«+i 


/V 

=  an  bn  + 


fin 
fill  + 


&n+l  —  bn  4" 


fi  12 
^11  + 


(10.2.8) 

(10.2.9) 


Solving  (10.2.8)  for  (Yn  —  an  —  bn)  and  substituting  into  (10.2.9),  we  find  that 


an+ 1  =  aYn+ 1  +  (1  -  a)  [an  +  bnj  ,  (10.2.10) 

bn+l  =  P  (On+1  ~  On)  +  (1  "  P)k  (10.2.11) 

with  a  =  Q\\/  (fin  +  crfy  and  P  =  fi2i/fin-  These  equations  coincide  with  the 
Holt-Winters  recursions  (10.2.3)  and  (10.2.4).  Equations  relating  a  and  /3  to  the 
variances  o^,  av2,  and  a2  can  be  found  in  Harvey  (1990). 


To  predict  24  values  of  the  accidental  deaths  series  using  the  Holt- Winters  algorithm, 
open  the  file  DEATHS. TSM  and  select  Forecast ing>Holt -Winters.  In  the 
resulting  dialog  box  specify  24  for  the  number  of  predictors  and  check  the  box  marked 
Optimize  coefficients  for  automatic  selection  of  the  smoothing  coefficients 
a  and  j3.  Click  OK,  and  the  forecasts  will  be  plotted  with  the  original  data  as  shown  in 
Figure  10-2.  Right-click  on  the  graph  and  then  Inf  o  to  see  the  numerical  values  of 
the  predictors,  their  root  mean  squared  errors,  and  the  optimal  values  of  a  and  /}.  The 
predicted  and  observed  values  are  shown  in  Table  10.2. 

□ 

The  root  mean  squared  error  (jY^h=i  (T72+/Z— T>72F72+/z)2/6)  for  the  nonseasonal 
Holt-Winters  forecasts  is  found  to  be  1 143.  Not  surprisingly,  since  we  have  not  taken 
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seasonality  into  account,  this  is  a  much  larger  value  than  for  the  three  sets  of  forecasts 
shown  in  Table  10.1.  In  the  next  section  we  show  how  to  modify  the  Holt-Winters 
algorithm  to  allow  for  seasonality. 


1 0.2.2  Holt-Winters  and  ARIMA  Forecasting 

The  one- step  forecasts  obtained  by  exponential  smoothing  with  parameter  a  (defined 
by  (1.5.7)  and  (10.2.1))  satisfy  the  relations 

PnYn+ 1  =  Yn  —  (1  —  a)(Yn  -  Pn-iYn),  n>2.  (10.2.12) 

But  these  are  the  same  relations  satisfied  by  the  large-sample  minimum  mean  squared 
error  forecasts  of  the  invertible  ARIMA(0,1,1)  process 

Y,  —  y,_i  +  Zt  —  (1  —  a)Zt-i,  {Z(}~WN(0,ff2).  (10.2.13) 

Forecasting  by  exponential  smoothing  with  optimal  a  can  therefore  be  viewed  as  fitting 
a  member  of  the  two-parameter  family  of  ARIMA  processes  (10.2.13)  to  the  data  and 
using  the  corresponding  large-sample  forecast  recursions  initialized  by  PqYi  =  Y\.  In 
ITSM,  the  optimal  a  is  found  by  minimizing  the  average  squared  error  of  the  one-step 
forecasts  of  the  observed  data  Y2,  ...  ,Yn,  and  the  parameter  a2  is  estimated  by  this 
average  squared  error.  This  algorithm  could  easily  be  modified  to  minimize  other  error 
measures  such  as  average  absolute  one-step  error  and  average  12-step  squared  error. 

In  the  same  way  it  can  be  shown  that  Holt-Winters  forecasting  can  be  viewed  as 
fitting  a  member  of  the  three-parameter  family  of  ARIMA  processes, 

(1  -  B)2Y ,  =Zt-(  2-a-  +  (1  -  a)Zr_2,  (10.2.14) 

where  {z,}  ~  WN(0,  a2).  The  coefficients  a  and  /3  are  selected  as  described  after 
(10.2.6),  and  the  estimate  of  a2  is  the  average  squared  error  of  the  one-step  forecasts 
of  Y3,  ...  ,Yn  obtained  from  the  large-sample  forecast  recursions  corresponding  to 
(10.2.14). 


Figure  10-2 

The  data  set  DEATHS. TSM 
with  24  values  predicted  by 
the  nonseasonal 
Holt-Winters  algorithm 
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Table  10.2  Predicted  and  observed  values  of  the  accidental  deaths  series 

for  t  =  73,  78  from  the  (nonseasonal)  Holt-Winters  algo¬ 

rithm 


t 

73 

74 

75 

76 

77 

78 

Observed  Yt 

7798 

7406 

8363 

8460 

9217 

9316 

Predicted  by  HW 

9281 

9322 

9363 

9404 

9445 

9486 

10.3  The  Holt-Winters  Seasonal  Algorithm 


1 0.3.1  The  Algorithm 

If  the  series  Y\,  Y2,  . . . ,  Yn  contains  not  only  trend,  but  also  seasonality  with  period  d 
[as  in  the  model  (1.5.1 1)],  then  a  further  generalization  of  the  forecast  function  (10.2.2) 
that  takes  this  into  account  is 

PnYn+h  =  an  +  bnh  +  cn+h,  h  —  1,2,...,  (10.3.1) 

A 

where  an,  bn,  and  cn  can  be  thought  of  as  estimates  of  the  “trend  level”  an,  “trend 
slope”  bn,  and  “seasonal  component”  cn  at  time  n.  If  k  is  the  smallest  integer  such  that 
n  +  h  —  kd  <  n,  then  we  set 


A  A 

Cn+h  —  kd  ? 


(10.3.2) 


while  the  values  of  aL,  b{,  and  ct,i  =  d+ 2,  . . . ,  n ,  are  found  from  recursions  analogous 
to  (10.2.3)  and  (10.2.4),  namely, 

an+ 1  =  oi  (Yn+ 1  -  c„+i_d)  +  (1  -  a)(an  +  bn),  (10.3.3) 

bn+ 1  =  P  (an+ 1  -  an)  +  (1  -  P)bn,  (10.3.4) 

and 


Cn+ 1  —  YiXn+\  CLn+ 1)  (1  y)cn- |-l— 

with  initial  conditions 


(10.3.5) 


=  Yd+1,  (10.3.6) 

=  (Yd+ 1  -  F0M  (10.3.7) 

and 

Ci  =  Yi  —  (F!  +  ^+i(/  -  1)),  i=l,...,d+l.  (10.3.8) 

A 

Then  (10.3.3)-(10.3.5)  can  be  solved  successively  for  at,  bt,  and  q,  /  =  d  +  1, . . . ,  n, 
and  the  predictors  PnYn+h  found  from  (10.3.1). 

As  in  the  nonseasonal  case  of  Section  10.2,  the  forecasts  depend  on  the  parameters 
a,  and  y.  These  can  either  be  prescribed  arbitrarily  (with  values  between  0  and  1)  or 
chosen  in  a  more  systematic  way  to  minimize  the  sum  of  squares  of  the  one-step  errors 
YH=d+2(¥i  ~  Pi-iYi )2,  obtained  when  the  algorithm  is  applied  to  the  already  observed 
data.  Seasonal  Holt- Winters  forecasts  can  be  computed  by  selecting  the  ITSM  option 
Fore casting> Seasonal  Holt -Winters. 


Example  1 0.3.1  As  in  Example  10.2.1,  open  the  file  DEATHS. TSM,  but  this  time  select  Forecast  - 

ing>Seasonal  Holt -Winters.  Specify  24  for  the  number  of  predicted 
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values  required,  12  for  the  period  of  the  seasonality,  and  check  the  box  marked 
Optimize  Coefficients.  Click  OK,  and  the  graph  of  the  data  and  predicted 
values  shown  in  Figure  10-3  will  appear.  Right-click  on  the  graph  and  then  on  Info 
and  you  will  see  the  numerical  values  of  the  predictors  and  the  optimal  values  of  the 
coefficients  a,  /3,  and  y  (minimizing  the  observed  one-step  average  squared  error 
Y,Zu<Yi  ~  ^-i^)2/59).  Table  10.3  compares  the  predictors  of  Y73, . . . ,  F78  with  the 
corresponding  observed  values. 

□ 

The  root  mean  squared  error  {yj  Y^h=\^i+h  ~  PiiYu+h)1  /6  )  for  the  seasonal 
Holt-Winters  forecasts  is  found  to  be  401.  This  is  not  as  good  as  the  value  253 
achieved  by  the  ARAR  model  for  this  example  but  is  substantially  better  than  the 
values  achieved  by  the  nonseasonal  Holt-Winters  algorithm  (1143)  and  the  ARIMA 
models  (6.5.8)  and  (6.5.9)  (583  and  501,  respectively). 


1 0.3.2  Holt-Winters  Seasonal  and  ARIMA  Forecasting 

As  in  Section  10.2.2,  the  Holt-Winters  seasonal  recursions  with  seasonal  period  d 
correspond  to  the  large-sample  forecast  recursions  of  an  ARIMA  process,  in  this 
case  defined  by 

(1  _  B)(  1  -  Bd)Yt  =  Zt  +  •  •  •  +  Zt_d+l  +  y(  1  -  a)(Z,_d  -  ZM) 

—  (2  —  a  —  a/3)(Zt_\  +  •  •  •  +  Zt_j) 

+  (1  —  <x)(Zt-2  +  *  *  *  +  Zt-d-l), 

where  {Zt}  ~WN(0,  cr2).  Holt- Winters  seasonal  forecasting  with  optimal  a,  /?,  and  y 
can  therefore  be  viewed  as  fitting  a  member  of  this  four-parameter  family  of  ARIMA 
models  and  using  the  corresponding  large-sample  forecast  recursions. 


Table  1 0.3  Predicted  and  observed  values  of  the  accidental  deaths  series  for 

t  =  73,  . . . ,  78  from  the  seasonal  Holt-Winters  algorithm 


t 

73 

74 

75 

76 

77 

78 

Observed  Yt 

7798 

7406 

8363 

8460 

9217 

9316 

Predicted  by  HWS 

8039 

7077 

7750 

7941 

8824 

9329 

1 0.4  Choosing  a  Forecasting  Algorithm 

Real  data  are  rarely  if  ever  generated  by  a  simple  mathematical  model  such  as  an 
ARIMA  process.  Forecasting  methods  that  are  predicated  on  the  assumption  of  such  a 
model  are  therefore  not  necessarily  the  best,  even  in  the  mean  squared  error  sense.  Nor 
is  the  measurement  of  error  in  terms  of  mean  squared  error  necessarily  always  the  most 
appropriate  one  in  spite  of  its  mathematical  convenience.  Even  within  the  framework 
of  minimum  mean  squared-error  forecasting,  we  may  ask  (for  example)  whether  we 
wish  to  minimize  the  one-step,  two-step,  or  twelve-step  mean  squared  error. 
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Figure  10-3 

The  data  set  DEATHS. TSM 
with  24  values  predicted  by 
the  seasonal  Holt-Winters 

algorithm 


The  use  of  more  heuristic  algorithms  such  as  those  discussed  in  this  chapter 
is  therefore  well  worth  serious  consideration  in  practical  forecasting  problems.  But 
how  do  we  decide  which  method  to  use?  A  relatively  simple  solution  to  this  problem, 
given  the  availability  of  a  substantial  historical  record,  is  to  choose  among  competing 
algorithms  by  comparing  the  relevant  errors  when  the  algorithms  are  applied  to  the 
data  already  observed  (e.g.,  by  comparing  the  mean  absolute  percentage  errors  of  the 
12-step  predictors  of  the  historical  data  if  12-step  prediction  is  of  primary  concern). 

It  is  extremely  difficult  to  make  general  theoretical  statements  about  the  relative 
merits  of  the  various  techniques  we  have  discussed  (ARIMA  modeling,  exponential 
smoothing,  ARAR,  and  HW  methods).  For  the  series  DEATHS. TSM  we  found  on 
the  basis  of  average  mean  squared  error  for  predicting  the  series  at  times  73-78 
that  the  ARAR  method  was  best,  followed  by  the  seasonal  Holt-Winters  algorithm, 
and  then  the  ARIMA  models  fitted  in  Chapter  6.  This  ordering  is  by  no  means 
universal.  For  example,  if  we  consider  the  natural  logarithms  { Yt }  of  the  first  130 
observations  in  the  series  WINE. TSM  (Figure  I -I)  and  compare  the  average  mean 
squared  errors  of  the  forecasts  of  Yu\,  . . . ,  F142,  we  find  (Problem  10.2  that  an  MA(12) 
model  fitted  to  the  mean  corrected  differenced  series  {Yt  —  Yt_  12}  does  better  than 
seasonal  Holt-Winters  (with  period  12),  which  in  turn  does  better  than  ARAR  and 
(not  surprisingly)  dramatically  better  than  nonseasonal  Holt-Winters.  An  interesting 
empirical  comparison  of  these  and  other  methods  applied  to  a  variety  of  economic  time 
series  is  contained  in  Makridakis  et  al.  (1984). 

The  versions  of  the  Holt-Winters  algorithms  we  have  discussed  in  Sections  10.2 
and  10.3  are  referred  to  as  “additive,”  since  the  seasonal  and  trend  components  enter  the 
forecasting  function  in  an  additive  manner.  “Multiplicative”  versions  of  the  algorithms 
can  also  be  constructed  to  deal  directly  with  processes  of  the  form 

Yt  —  mtstZt,  (10.4.1) 

where  mt ,  st,  and  Zt  are  trend,  seasonal,  and  noise  factors,  respectively  (see,  e.g., 
Makridakis  et  al.  1997).  An  alternative  approach  (provided  that  Yt  >  0  for  all  t )  is 
to  apply  the  linear  Holt-Winters  algorithms  to  {In  Yt)  (as  in  the  case  of  WINE. TSM  in 
the  preceding  paragraph).  Because  of  the  rather  general  memory  shortening  permitted 
by  the  ARAR  algorithm,  it  gives  reasonable  results  when  applied  directly  to  series 
of  the  form  (10.4.1),  even  without  preliminary  transformations.  In  particular,  if  we 
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Figure  10-4 

The  first  1 32  values  of  the 
data  set  AIRPASS.TSM 
and  predictors  of  the  last 
12  values  obtained  by  direct 
application  of  the  ARAR 

algorithm 


consider  the  first  132  observations  in  the  series  AIRPASS.TSM  and  apply  the  ARAR 
algorithm  to  predict  the  last  12  values  in  the  series,  we  obtain  (Problem  10.4)  an 
observed  root  mean  squared  error  of  18.22.  On  the  other  hand  if  we  use  the  same 
data  take  logarithms,  difference  at  lag  12,  subtract  the  mean  and  then  fit  an  AR(13) 
model  by  maximum  likelihood  using  ITSM  and  use  it  to  predict  the  last  12  values,  we 
obtain  an  observed  root  mean  squared  error  of  21.17.  The  data  and  predicted  values 
from  the  ARAR  algorithm  are  shown  in  Figure  10-4. 


Problems 


10.1  Establish  the  formula  (10.1.8)  for  the  mean  squared  error  of  the  h- step  forecast 
based  on  the  ARAR  algorithm. 

10.2  Let  [X\,  . . . ,  X142}  denote  the  data  in  the  file  WINE.TSM  and  let  {Y\,  . . . ,  Yui) 
denote  their  natural  logarithms.  Denote  by  m  the  sample  mean  of  the  differenced 
series  { Yt  —  Yt_  12,  t  —  13, ... ,  130}. 

(a)  Use  the  program  ITSM  to  find  the  maximum  likelihood  MA(12)  model  for 
the  differenced  and  mean-corrected  series  {Yt  —  F,_i2  —  m,  t  —  13,  ... ,  130}. 

(b)  Use  the  model  in  (a)  to  compute  forecasts  of  {X131,  . . . ,  X142}. 

(c)  Tabulate  the  forecast  errors  { Xt  —  P130  Xt,t=  131, ... ,  142}. 

(d)  Compute  the  average  squared  error  for  the  12  forecasts. 

(e)  Repeat  steps  (b),  (c),  and  (d)  for  the  corresponding  forecasts  obtained  by 
applying  the  ARAR  algorithm  to  the  series  { Xt ,  t  =  1, . . . ,  130}. 

(f)  Repeat  steps  (b),  (c),  and  (d)  for  the  corresponding  forecasts  obtained  by 

applying  the  seasonal  Holt-Winters  algorithm  (with  period  12)  to  the 
logged  data  [Yt,  t  —  1,...,130}.  (Open  the  file  WINE.TSM,  select 

Transf  orm>Box-Coxwith  parameter  A  =  0,  then  select  Forecast  ing> 
Seasonal  Holt -Winters,  and  check  Apply  to  original  data 
in  the  dialog  box.) 


1 0.4  Choosing  a  Forecasting  Algorithm 


321 


(g)  Repeat  steps  (b),  (c),  and  (d)  for  the  corresponding  forecasts  obtained  by 
applying  the  nonseasonal  Holt-Winters  algorithm  to  the  logged  data  [Yt,  t  — 
1,  . . . ,  130}.  (The  procedure  is  analogous  to  that  described  in  part  (f).) 

(h)  Compare  the  average  squared  errors  obtained  by  the  four  methods. 

10.3  In  equations  (10.2.10)  and  (10.2.11),  show  that  ot=Qii/{Q\i  +  and 
P=^2i/^n- 

10.4  Verify  the  assertions  made  in  the  last  paragraph  of  Section  10.4,  comparing  the 
forecasts  of  the  last  12  values  of  the  series  AIRPASS.TSM  obtained  from  the 
ARAR  algorithm  (with  no  log  transformation)  and  the  corresponding  forecasts 
obtained  by  taking  logarithms  of  the  original  series,  then  differencing  at  lag  12, 
mean-correcting,  and  fitting  an  AR(13)  model  to  the  transformed  series. 
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1 1 .1  Transfer  Function  Models 

1 1 .2  Intervention  Analysis 

1 1 .3  Nonlinear  Models 

1 1 .4  Long-Memory  Models 

1 1 .5  Continuous-Time  ARMA  Processes 


In  this  chapter  we  touch  on  a  variety  of  topics  of  special  interest.  In  Section  1 1.1 
we  consider  transfer  function  models,  designed  to  exploit  for  predictive  purposes  the 
relationship  between  two  time  series  when  one  acts  as  a  leading  indicator  for  the  other. 
Section  11.2  deals  with  intervention  analysis,  which  allows  for  possible  changes  in 
the  mechanism  generating  a  time  series,  causing  it  to  have  different  properties  over 
different  time  intervals.  In  Section  11.3  we  introduce  the  very  fast  growing  area  of 
nonlinear  time  series  analysis,  and  in  Section  11.4  we  discuss  fractionally  integrated 
ARMA  processes,  sometimes  called  “long-memory”  processes  on  account  of  the  slow 
rate  of  convergence  of  their  autocorrelation  functions  to  zero  as  the  lag  increases.  In 
Section  11.5  we  discuss  continuous-time  ARMA  processes  which,  for  continuously 
evolving  processes,  play  a  role  analogous  to  that  of  ARMA  processes  in  discrete  time. 
Besides  being  of  interest  in  their  own  right,  they  have  proved  a  useful  class  of  models 
in  the  representation  of  financial  time  series  and  in  the  modeling  of  irregularly  spaced 
data. 


11.1  Transfer  Function  Models 


In  this  section  we  consider  the  problem  of  estimating  the  transfer  function  of  a  linear 


filter  when  the  output  includes  added  uncorrelated  noise.  Suppose  that  {Xri}  and  { X a] 
are,  respectively,  the  input  and  output  of  the  transfer  function  model 


(X) 


(11.1.1) 


©  Springer  International  Publishing  Switzerland  2016 

P.J.  Brockwell,  R.A.  Davis,  Introduction  to  Time  Series  and  Forecasting, 

Springer  Texts  in  Statistics,  DOI  10. 1007/978-3-3 19-29854-2_ll 


323 


324 


Chapter  1 1  Further  Topics 


where  T  —  {tjj  =  0,  1,  . . .}  is  a  causal  time-invariant  linear  filter  and  {Nt}  is  a 
zero-mean  stationary  process,  uncorrelated  with  the  input  process  {Z)i}.  We  further 
assume  that  {X?i}  is  a  zero-mean  stationary  time  series.  Then  the  bivariate  process 
{(Xti,Xt2)f }  is  also  stationary.  Multiplying  each  side  of  (11.1.1)  by  Xt^  \  and  then 
taking  expectations  gives  the  equation 

oo 

Y2\(k)  =  ET )Yn(k~j)-  (11.1.2) 

j= o 

Equation  (11.1.2)  simplifies  a  great  deal  if  the  input  process  happens  to  be  white 
noise.  For  example,  if  {X?i}  ~  WN(0,  of),  then  we  can  immediately  identify  4  from 
(11.1.2)  as 

rk  =  Y2iik)/al  (11.1.3) 

This  observation  suggests  that  “prewhitening”  of  the  input  process  might  simplify  the 
identification  of  an  appropriate  transfer  function  model  and  at  the  same  time  provide 
simple  preliminary  estimates  of  the  coefficients  4. 

If  {Xri}  can  be  represented  as  an  invertible  ARMA (p,  q )  process 

4>(B)Xtl  =  6{B)Z„  {Z,}  ~  WN  (0,  4)  ,  (11.1.4) 

then  application  of  the  filter  n (B)  =  (j){B)6~ l(B)  to  {X?i}  will  produce  the  whitened 
series  {Zt}.  Now  applying  the  operator  n{B)  to  each  side  of  (11.1.1)  and  letting  Yt  — 
n(B)Xt2 ,  we  obtain  the  relation 

00 

^  =  E  XiZ'~i  +  N'n 

j= 0 

where 


N[  =  7 t(B)Nu 

and  {N[}  is  a  zero-mean  stationary  process,  uncorrelated  with  {Zt}.  The  same  arguments 
that  led  to  (11.1.3)  therefore  yield  the  equation 

?j  =  PYz(j)&Y/c>Z ,  (11.1.5) 

where  pYz  is  the  cross-correlation  function  of  { Yt }  and  {Zt},  of  =Var(Z?),  and 
<Jy  =Var(Fr). 

Given  the  observations  {( Xt\,Xa)r ,  t  —  1, . . . ,  n),  the  results  of  the  previous 
paragraph  suggest  the  following  procedure  for  estimating  {4}  and  analyzing  the  noise 
{Nt}  in  the  model  (11.1.1): 

A  A 

1.  Fit  an  ARMA  model  to  {Xn }  and  file  the  residuals  (Z\ ,  . . . ,  Zn )  (using  the  Export 
button  in  ITSM  to  copy  them  to  the  clipboard  and  then  pasting  them  into  the  first 

/V  /V 

column  of  an  Excel  file).  Fet  <fi  and  6  denote  the  maximum  likelihood  estimates 
of  the  autoregressive  and  moving-average  parameters  and  let  of  be  the  maximum 
likelihood  estimate  of  the  variance  of  {Zt}. 

2.  Apply  the  operator  fc(B)  =  cj)(B)d~[(B)  to  { X a]  to  obtain  the  series  \  Y\, ...  ,Yn). 
(After  fitting  the  ARMA  model  as  in  Step  1  above,  highlight  the  window  contain¬ 
ing  the  graph  of  {X,}  and  replace  {Xr}  by  {Fr}  using  the  option  File>  Import.  The 
residuals  are  then  automatically  replaced  by  the  residuals  of  [Yt]  under  the  model 
already  fitted  to  {Xr}.)  Export  the  new  residuals  to  the  clipboard,  paste  them  into 
the  second  column  of  the  Excel  file  created  in  Step  1,  and  save  this  as  a  text  file, 
FNAME.TSM.  The  file  FNAME.TSM  then  contains  the  bivariate  series  {( Zt ,  Fr)}. 
Let  of  denote  the  sample  variance  of  Yt. 
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3.  Compute  the  sample  auto-  and  cross-correlation  functions  of  [Zt]  and  {Fr}  by 
opening  the  bivariate  project  FNAME.TSM  in  ITSM  and  clicking  on  the  second 
yellow  button  at  the  top  of  the  ITSM  window.  Comparison  of  pYZ(h )  with  the 
bounds  d=  1.96ft- 1/2  gives  a  preliminary  indication  of  the  lags  h  at  which  pYZ(h) 
is  significantly  different  from  zero.  A  more  refined  check  can  be  carried  out  by 
using  Bartlett’s  formula  in  Section  8.3.4  for  the  asymptotic  variance  of  pYZ(h). 
Under  the  assumptions  that  {Zr}  ~  WN  (0,  <r2)  and  {(Fr,  Zr)  }  is  a  stationary 
Gaussian  process, 


nVar(pYZ(h))  ~  1  -  pjz(h) 


OO 

1.5-  £  (p2yZ(k)  +  pjyik)  /  2) 

k=—o O 


OO 

+  \pYz(h  +  k)pYz(h  —  k)  —  2pYz(h) pYz(k  +  h)p\Y(k)\  . 

k=—o o 

In  order  to  check  the  hypothesis  Hq  that  PYz(h )  =  0,  /z  ^  [ft,  /?],  where  ft  and  /? 
are  integers,  we  note  from  Corollary  8.3.1  that  under  Hq, 

Var  ( PYzih ))  ~  ft-1  for  h  [a,  b]. 

We  can  therefore  check  the  hypothesis  Hq  by  comparing  pYz ,  h  £  [ft,  /?],  with  the 
bounds  d=  1.96ft-1/2.  Observe  that  pzy(h )  should  be  zero  for  h  >  0  if  the  model 
(11.1.1)  is  valid. 

4.  Preliminary  estimates  of  r h  for  the  lags  h  at  which  pYZ(h )  is  significantly  different 
from  zero  are 


T/z  —  PyzW&y/^Z- 

For  other  values  of  h  the  preliminary  estimates  are  f \  —  0.  The  numerical  values 
of  the  cross-correlations  PYzih)  are  found  by  right-clicking  on  the  graphs  of  the 
sample  correlations  plotted  in  Step  3  and  then  on  Info.  The  values  of  az  and  aY 
are  found  by  doing  the  same  with  the  graphs  of  the  series  themselves.  Let  m  >  0 
be  the  largest  value  of  j  such  that  f )  is  nonzero  and  let  b  >  0  be  the  smallest  such 
value.  Then  b  is  known  as  the  delay  parameter  of  the  filter  { r ; } .  If  m  is  very  large 
and  if  the  coefficients  { f ) }  are  approximately  related  by  difference  equations  of  the 
form 


Xj  -  V\ Xj—i - Vptj-p  =  j  >b+p, 

then  T (. B )  =  YljLh  can  represented  approximately,  using  fewer  parame¬ 
ters,  as 

T(B)  =  w0(  1  -  vi B - vpBp)~lBb. 

In  particular,  if  Xj  =  0,  j  <  b,  and  ij  —  wo^  ,  j  >  b,  then 

T(B)  =  w0(l-vlB)-lBb.  (11.1.6) 

A 

Box  and  Jenkins  (1976)  recommend  choosing  T{B)  to  be  a  ratio  of  two  poly¬ 
nomials.  However,  the  degrees  of  the  polynomials  are  often  difficult  to  estimate 
from  [ij].  The  primary  objective  at  this  stage  is  to  find  a  parametric  function 

that  provides  an  adequate  approximation  to  T(B )  without  introducing  too  large 
a  number  of  parameters.  If  T(B )  is  represented  as  T(B )  =  BDw(B)v~  (B)  = 

Bb  (wo  +  w\B  +  •  •  •  +  wqB?)  (l  —  v\B  —  ...  —  VpBP)  1  with  v(z)  /  0  for  |z|  <  1, 
then  we  define  m  =  max(^  +  b,  p). 
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Example  11.1.1 


5.  The  noise  sequence  {Nt,  t  —  m  +  1 . n]  is  estimated  as 

Nt  =  Xt2  -  T (B)Xt\ . 

A  A 

(We  set  Nt  =  0,  t  <  m,  in  order  to  compute  Nt,t>m  =  ma x(b  +  q,  p)).  The  cal¬ 
culations  are  done  in  ITSM  by  opening  the  bivariate  file  containing  {(Xti,Xt2)}, 
selecting  Transf er>Specify  Model,  and  entering  the  preliminary  model 
found  in  Step  4.  Click  on  the  fourth  green  button  to  see  a  graph  of  the  residuals 
{Nt}.  These  should  then  be  filed  as,  say,  NOISE.TSM. 

6.  Preliminary  identification  of  a  suitable  model  for  the  noise  sequence  is  carried 
out  by  fitting  a  causal  invertible  ARMA  model 

4><n>(B)N,  =  9(N)(B)Wt,  {Wt}  ~  WN  (0,  o£)  ,  (11.1.7) 

y\  a 

to  the  estimated  noise  Nm+\, . . . ,  Nn  filed  as  NOISE.TSM  in  Step  5. 

7.  At  this  stage  we  have  the  preliminary  model 

^N)(B)v(B)Xt2  =  Bb(j){N\B)w(B)Xa  +  0m(B)v(B)Wt , 

where  T(B )  =  Bbw(B)v~l(B)  as  in  step  (4).  For  this  model  we  can  compute 
Wt  (w,  v,  </>w,  0(Ar>),  t  >  m*  =  max(/?2  +  p,  b  +  p2  +  q ),  by  setting  Wt  =  0 

for  t  <  m* .  The  parameters  w,  v,  </>w,  and  6^N)  can  then  be  reestimated  (more 
efficiently)  by  minimizing  the  sum  of  squares 

J2  v, 

t=m*  + 1 

(The  calculations  are  performed  in  ITSM  by  opening  the  bivariate  project  {(Xu, 
Xt2)},  selecting  Transf  er>Specify  model,  entering  the  preliminary  model, 
and  clicking  OK.  Then  choose  Transf  er>Estimation,  click  OK,  and  the  least 
squares  estimates  of  the  parameters  will  be  computed.  Pressing  the  fourth  green 

A 

button  at  the  top  of  the  screen  will  give  a  graph  of  the  estimated  residuals  Wt.) 

8.  To  test  for  goodness  of  fit,  the  estimated  residuals  { Wt,  t  >  m*}  and  \Zt.t  >  m*} 
should  be  filed  as  a  bivariate  series  and  the  auto-  and  cross  correlations  compared 
with  the  bounds  ±1.96/^^  in  order  to  check  the  hypothesis  that  the  two  series 
are  uncorrelated  white  noise  sequences.  Alternative  models  can  be  compared 
using  the  AICC  value  that  is  printed  with  the  estimated  parameters  in  Step  7. 
It  is  computed  from  the  exact  Gaussian  likelihood,  which  is  computed  using  a 
state-space  representation  of  the  model,  described  in  Brockwell  and  Davis  (1991), 
Section  13.1. 

Sales  with  a  Leading  Indicator 

In  this  example  we  fit  a  transfer  function  model  to  the  bivariate  time  series  of 
Example  8.1.2.  Let 

Xn  =  (1  -  B)Ytl  -  0.0228,  t  =  2, . . . ,  150, 

X?2  =  (1  -  B)Yt2  -  0.420,  t  =  2, . . . ,  150, 

where  {Yt{\  and  {Yt2},  t  —  1,  . . . ,  150,  are  the  leading  indicator  and  sales  data, 
respectively.  It  was  found  in  Example  8.1.2  that  {X?i}  and  { Xt2 }  can  be  modeled  as 
low-order  zero-mean  ARMA  processes.  In  particular,  we  fitted  the  model 

Xtl  =  (1  -  0.474£)Z„  {Ztj  -  WN(0,  0.0779), 

to  the  series  {Xt\ }.  We  can  therefore  whiten  the  series  by  application  of  the  filter 
jt(B)  =  (1  —  0.474#)-1.  Applying  ft(B)  to  both  {X?i}  and  {Xt2}  we  obtain 
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%  =  (1  -  0.4745)-% ,  of  =  0.0779, 
yf  =  (1  -  0.4745)_1X;2,  =  4.0217. 

These  calculations  and  the  filing  of  the  series  (ZtJ  and  If,}  were  carried  out  using 
ITSM  as  described  in  steps  (1)  and  (2).  Their  sample  auto-  and  cross-correlations, 
found  as  described  in  step  (3),  are  shown  in  Figure  11-1.  The  cross-correlations 
PzyW  (top  right)  and  pyzih)  (bottom  left),  when  compared  with  the  upper  and  lower 
bounds  ±1.96(149)_1/2  =  ±0.161,  strongly  suggest  a  transfer  function  model  for 
{Xt2}  in  terms  of  {X?i}  with  delay  parameter  3.  Since  f )  —  pyz(j)^y/^z  is  decreasing 
approximately  geometrically  for  j  >  3,  we  take  T(B )  to  have  the  form  (11.1.6),  i.e., 

T(B)  =  w0(l  -  vi 5)_153. 

The  preliminary  estimates  of  wo  and  v\  are  w0  =  r3=  4.86  and  v\  =  x\/x3  —  0.698,  the 
coefficients  x )  being  estimated  as  described  in  step  (4).  The  estimated  noise  sequence 
is  determined  and  filed  using  ITSM  as  described  in  step  (5).  It  satisfies  the  equations 

Nt  =  X,2  -  4.8653(1  -  0.6985)“%,  t  =  5,  6, ... ,  150. 

Analysis  of  this  univariate  series  with  ITSM  gives  the  MA(1)  model 

Nt  =  (l-  0.364 B)WU  {Wt}  ~  WN(0,  0.0590). 

Substituting  these  preliminary  noise  and  transfer  function  models  into  equa¬ 
tion  (11.1.1)  then  gives 

Xt2  =  4.8653(1  -0.6985)“%  +  (1-0.3645) W„  {Wt}  ~  WN(0,  0.0590). 

Now  minimizing  the  sum  of  squares  (11.1.7)  with  respect  to  the  parameters  (wo,  v\, 
9j(A,>)  as  described  in  step  (7),  we  obtain  the  least  squares  model 

Xt2  =  4.71753(1  -  0.7245)“%  +  (1  -  0.5825)  W„  (11.1.8) 

where  {Vf/}  ~  WN(0,  0.0486)  and 

Xtl  =  (1  -  0.474£)Z„  {Ztj  -  WN(0,  0.0779). 

Notice  the  reduced  white  noise  variance  of  {Wj}  in  the  least  squares  model  as  compared 
with  the  preliminary  model. 

The  sample  auto-  and  cross-correlation  functions  of  the  series  Zt  and  Wt, 
t  =  5,  . . . ,  150,  are  shown  in  Figure  11-2.  All  of  the  correlations  lie  between  the 
bounds 

±1.96/V  144,  supporting  the  assumption  underlying  the  fitted  model  that  the  residuals 
are  uncorrelated  white  noise  sequences. 

□ 


11.1.1  Prediction  Based  on  a  Transfer  Function  Model 

When  predicting  Xn+h,2  on  the  basis  of  the  transfer  function  model  defined  by  ( 1 1 . 1 . 1), 
(11.1.4),  and  (11.1.7),  with  observations  of  Xt\  and  Xt2,  t  —  1 , ,n,  our  aim  is  to 
find  the  linear  combination  of  1,  Xu,  . . . ,  Xn\,  X\2,  . . . ,  Xn2  that  predicts  Xn+h,2  with 
minimum  mean  squared  error.  The  exact  solution  of  this  problem  can  be  found  with 
the  help  of  the  Kalman  recursions  (see  Brockwell  and  Davis  (1991),  Section  13.1  for 
details).  The  program  ITSM  uses  these  recursions  to  compute  the  predictors  and  their 
mean  squared  errors. 

In  order  to  provide  a  little  more  insight,  we  give  here  the  predictors  PnXn+h 
and  mean  squared  errors  based  on  infinitely  many  past  observations  Xt\  and  Xt2, 
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Figure  11-1 

The  sample  correlation 
functions  of  Example 

11.1.1.  Series  1  is  {Zj}  and 
Series  2  is  {Vf} 


—oo<t<n.  These  predictors  and  their  mean  squared  errors  will  be  close  to  those 
based  on  Xt\  and  Xt2,  1  <  t  <  n,  if  n  is  sufficiently  large. 

The  transfer  function  model  defined  by  (11.1.1),  (11.1.4),  and  (11.1.7)  can  be 
rewritten  as 


Xa  =  T(B)Xtl  +  P(1 B)W„ 

Xn=e(  B)(P~\B)Zt, 

where  /3(B)  =  0lN'(B) /(j)lN,(B).  Eliminating  X,\  gives 


(11.1.9) 

(11.1.10) 


oo 


oo 


(li.i.ii) 


V2  =  X  ajZH  +  J2  (W>- 

j= 0  j= 0 

where  a  (B)  =  T(B)0(B) /0(B). 

Noting  that  each  limit  of  linear  combinations  of  {Xt\,Xt2,  —oo<t<n}isa 
limit  of  linear  combinations  of  { Zt ,  Wt,  —  oo  <  t  <  n]  and  conversely  and  that  {Ztj  and 
{ W^}  are  uncorrelated,  we  see  at  once  from  (11.1.11)  that 


oo 


oo 


(11.1.12) 


P n^n+h, 2  —  ^  ^  ^j^n-\-h—j  T  E  W n+h—j  • 

j—h  j—h 

Setting  t  =  n  +  h  in  (11.1.11)  and  subtracting  (11.1.12)  gives  the  mean  squared  error 


h- 1 


h- 1 


E  \Xn+K2  -  hXn+Ki)  -  oj  T!  aj  +  XI  P?- 


(11.1.13) 


1=0  1=0 

To  compute  the  predictors  PnXn+^2  we  proceed  as  follows.  Rewrite  (11.1.9)  as 
A(B)X,  2  =  BblJ{B)Xt]  +  V(B)IE,,  (11.1.14) 
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Figure  11-2 

The  sample  correlation 
functions  of  the  estimated 
residuals  from  the 
model  fitted  in 
Example  11.1.1.  Series  1  is 
{Zj}  and  Series  2  is  {Wt} 


Example  11.1.2 


Series  1  Series  1  x  Series  2 
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where  A,  U,  and  V  are  polynomials  of  the  form 

A(B)  =  l-AlB - AaB\ 

U(B)  =  C0  +  UiB  +  ■  ■  •  +  UUBU, 

V(B)  =  1  +  VXB  +  ■  ■  •  +  VvBv. 

Applying  the  operator  Pn  to  equation  (11.1.14)  with  t  =  n  +  h,  we  obtain 

a  u  v 

PnXn-\-h,2  —  ^  ^  AjP nXn-\-h—j ,2  T  ^  ^  UjPnXn-\-h—b—j,  1  +  E  VjWn+h-j,  (11.1.15) 

7=1  7=0  J=h 

where  the  last  sum  is  zero  if  h  >  v. 

Since  {X,i}  is  uncorrelated  with  {Wt},  the  predictors  appearing  in  the  second  sum  in 
(1 1.1.15)  are  therefore  obtained  by  predicting  the  univariate  series  {X?i}  as  described  in 
Section  3.3  using  the  model  (11.1.10).  In  keeping  with  our  assumption  that  n  is  large, 
we  can  replace  PnXj\  for  each  j  by  the  finite-past  predictor  obtained  from  the  program 

A 

ITSM.  The  values  WjJ  <  n ,  are  replaced  by  their  estimated  values  Wj  from  the  least 
squares  estimation  in  step  (7)  of  the  modeling  procedure. 

Equation  (11.1.15)  can  now  be  solved  recursively  for  the  predictors  PnXn+ 1^2, 
PnXn-\-2,2->  PnXn+3, 2?  •  •  •  • 


Sales  with  a  Leading  Indicator 

Applying  the  preceding  results  to  the  series  {Xt\ ,  Xt2 ,2  <  t  <  150}  of  Example  11.1.1, 
and  using  the  values  Xi^i  =  — 0.093,  Xi5o,2  =  0.08,  W^o  =  —0.0706,  Wug  = 
0.1449,  we  find  from  (11.1.8)  and  (11.1.15)  that 

Pi50*i5i, 2  =  0.724^50,2  +  4.7  1  7X148,i  -  1.306W150  +  0.421  W149  =  -0.228 
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and,  using  the  value  Xug^  =  0.237,  that 

PisoX 152,2  =  0.724P15oX15if2  +  4.717X149fl  +  0.421  W150  =  0.923. 

In  terms  of  the  original  sales  data  {F/2}  we  have  Y\^),i  —  262.7  and 

Ya  =  Yt-i,2+  262  +  0.420. 

Hence  the  predictors  of  actual  sales  are 

P*50Yi5i,  2  =  262.70  -  0.228  +  0.420  =  262.89, 

i%Fi52,  2  =  262.89  +  0.923  +  0.420  =  264.23, 

where  P [49  is  based  on  {1,  Fn,  Yu,  Xsl,  XS2,  —  oo  <  s  <  150},  and  it  is  assumed  that 
Y\\  and  Yu  are  uncorrelated  with  {Xs\}  and  with  {Xu}.  The  predicted  values  are  in 
close  agreement  with  those  based  on  the  finite  number  of  available  observations  that 
are  computed  by  ITSM.  Since  our  model  for  the  sales  data  is 

(1  -B)Ya  =  0.420+4.717£3(l  —0.4745)(1  —0.7245)_1Zr+(l  —0.5825)  Wr, 

it  can  be  shown,  using  an  argument  analogous  to  that  which  gave  (11.1.13),  that  the 
mean  squared  errors  are  given  by 

h- 1  h — 1 

P(.Y\50+h,2  —  Pl50Y\50+h,2)2  =  <?z  aT  +  aW  ' 

j= 0  2=0 

where 

oo 

=  4-717z3(1  -  0.474z)(l  -  0.724z)_1(l  -  z)~l 
j= o 
and 

oo 

J2Pj7!  =  (1  -0.582z)(l  -Z)"1. 
j= o 

For  h  =  1  and  2  we  obtain 

E(Yl5 1,  2  -  P*l50Yi5i,  2)2  =  0.0486, 

E(Yl52,  2  -  ^50^152,  2)2  =  0.0570, 

in  close  agreement  with  the  finite-past  mean  squared  errors  obtained  by  ITSM. 

It  is  interesting  to  examine  the  improvement  obtained  by  using  the  transfer 
function  model  rather  than  fitting  a  univariate  model  to  the  sales  data  alone.  If  we  adopt 
the  latter  course,  we  obtain  the  model 

X?2  -  0.249X,_lf  2  -  0.199X,_2,2  =  Ut, 

where  {Ut}  ~  WN(0,  1.794)  and  Xt2  =  Yt 2  —  Fr_i,2  —  0.420.  The  corresponding 
predictors  of  F^  2  and  F152,  2  are  easily  found  from  the  program  ITSM  to  be  263.14 
and  263.58  with  mean  squared  errors  1.794  and  4.593,  respectively.  These  mean 
squared  errors  are  much  worse  than  those  obtained  using  the  transfer  function  model. 

□ 
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During  the  period  for  which  a  time  series  is  observed,  it  is  sometimes  the  case  that  a 
change  occurs  that  affects  the  level  of  the  series.  A  change  in  the  tax  laws  may,  for 
example,  have  a  continuing  effect  on  the  daily  closing  prices  of  shares  on  the  stock 
market.  In  the  same  way  construction  of  a  dam  on  a  river  may  have  a  dramatic  effect 
on  the  time  series  of  streamflows  below  the  dam.  In  the  following  we  shall  assume  that 
the  time  T  at  which  the  change  (or  “intervention”)  occurs  is  known. 

To  account  for  such  changes,  Box  and  Tiao  (1975)  introduced  a  model  for 
intervention  analysis  that  has  the  same  form  as  the  transfer  function  model 

oo 

Yt  =  YJ*jXt-j  +  Nt,  (11.2.1) 

j= 0 

except  that  the  input  series  {XJ  is  not  a  random  series  but  a  deterministic  function  of  t. 
It  is  clear  from  (11.2. 1)  that  Yljlo  E^t-j  *s  ^en  mean  °f  T*.  The  function  {XJ  and 
the  coefficients  {tJ  are  therefore  chosen  in  such  a  way  that  the  changing  level  of  the 
observations  of  {FJ  is  well  represented  by  the  sequence  Yljlo  r/'^/-/-  T°r  a  series  {FJ 
with  EYj  =  0  for  t  <  T  and  EYt  — >  0  as  t  ->  oo,  a  suitable  input  series  is 


if  t  =  T, 
if  t  +  T. 


(11.2.2) 


For  a  series  {FJ  with  EYt  =  0  for  t  <  T  and  EYt  -^a^Oasr-^oo,  a  suitable  input 
series  is 


X,  =  H,(T)  =  £/,(*) 

k=T 


1  if  t>T, 
0  if*<r. 


(11.2.3) 


(Other  deterministic  input  functions  { Xt }  can  also  be  used,  for  example  when  inter¬ 
ventions  occur  at  more  than  one  time.)  The  function  {Xr}  having  been  selected  by 
inspection  of  the  data,  the  determination  of  the  coefficients  {tJ  in  (1 1.2. 1)  then  reduces 
to  a  regression  problem  in  which  the  errors  {XJ  constitute  an  ARMA  process.  This 
problem  can  be  solved  using  the  program  ITSM  as  described  below. 

The  goal  of  intervention  analysis  is  to  estimate  the  effect  of  the  intervention 
as  indicated  by  the  term  YljloT. )Xt-j  and  to  use  the  resulting  model  (11.2.1)  for 
forecasting.  For  example,  Wichern  and  Jones  (1978)  used  intervention  analysis  to 
investigate  the  effect  of  the  American  Dental  Association’s  endorsement  of  Crest 
toothpaste  on  Crest’s  market  share.  Other  applications  of  intervention  analysis  can  be 
found  in  Box  and  Tiao  (1975),  Atkins  (1979),  and  Bhattacharyya  and  Layton  (1979).  A 
more  general  approach  can  also  be  found  in  West  and  Harrison  (1989),  Harvey  (1990), 
and  Pole  et  al.  (1994). 

As  in  the  case  of  transfer  function  modeling,  once  {XJ  has  been  chosen  (usually  as 
either  (11.2.2)  or  (11.2.3)),  estimation  of  the  linear  filter  {rj  in  (11.2.1)  is  simplified 
by  approximating  the  operator  T  ( B )  =  Yljlo  xi®  whh  a  rational  operator  of  the 
form 

BbW(B ) 

T(B)  = - — ,  (11.2.4) 

V(B)  ' 

where  b  is  the  delay  parameter  and  W(B )  and  V(B )  are  polynomials  of  the  form 


W(B )  =  wo  +  vv  ]  B  +  •  •  •  +  wqBq 
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and 


V(B)  =  1  -  vi  B - vpBp. 


By  suitable  choice  of  the  parameters  b ,  q,  p  and  the  coefficients  and  v;,  the 
intervention  term  T{B)Xt  can  made  to  take  a  great  variety  of  functional  forms. 

For  example,  if  T(B)  =  wB2/(  1  —  vB)  and  =  4(F)  as  in  (11.2.2),  the 
resulting  intervention  term  is 


wB 


oo 


oo 


(1  -  vB) 


h(T)  =  J2  v’wIt-j-ziT)  =  >yw’/,(r  +  2  +j) 


j= 0 


j= 0 


a  series  of  pulses  of  sizes  v'w  at  times  T  +  2  +  jj  =  0,  1,  2,  ....  If  |v|  <  1,  the  effect 
of  the  intervention  is  to  add  a  series  of  pulses  with  size  w  at  time  T  +  2,  decreasing  to 
zero  at  a  geometric  rate  depending  on  v  as  t  ->  oo.  Similarly,  with  Xt  =  Ht(T )  as  in 
(11.2.3), 


wB2 

(1  -  vB) 


Ht(T )  =  2 

j=o 


(T)  —  ^~^(1  +  v  +  •  •  •  +  i J)wIt(T  +  2  +  j), 

j=o 


a  series  of  pulses  of  sizes  (1  +  v  +  •  •  •  +  v>)w  at  times  T  +  2  +  j,  j  =  0,  1,  2, _ 

If  |v|  <  1,  the  effect  of  the  intervention  is  to  bring  about  a  shift  in  level  of  the  series 
Xt,  the  size  of  the  shift  converging  to  w/(l  —  v)  as  t  — ►  oo. 

An  appropriate  form  for  Xt  and  possible  values  of  b ,  q ,  and  p  having  been  chosen 
by  inspection  of  the  data,  the  estimation  of  the  parameters  in  (11.2.4)  and  the  fitting 
of  the  model  for  {Nj}  can  be  carried  out  using  steps  (6)-(8)  of  the  transfer  function 
modeling  procedure  described  in  Section  11.1.  Start  with  step  (7)  and  assume  that 
{Nt}  is  white  noise  to  get  preliminary  estimates  of  the  coefficients  and  Vj  by  least 
squares.  The  residuals  are  filed  and  used  as  estimates  of  {TV/ }.  Then  go  to  step  (6)  and 
continue  exactly  as  for  transfer  function  modeling  with  input  series  {X/}  and  output 
series  {Yt}. 


Figure  11-3 

The  differenced  series 
of  Example  1 1 .2.1 
(showing  also  the 
fitted  intervention  term 
accountingfor  the  seat-belt 
legislation  of  1 983) 
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Seat-Belt  Legislation 

In  this  example  we  reanalyze  the  seat-belt  legislation  data,  SBL.TSM  of  Example  6.6.3 
from  the  point  of  view  of  intervention  analysis.  For  this  purpose  the  bivariate  series 
{(ft,  Yt )}  consisting  of  the  series  filed  as  SBLIN.TSM  and  SBL.TSM  respectively 
has  been  saved  in  the  file  SBL2.TSM.  The  input  series  [f] }  is  the  deterministic  step- 
function  defined  in  Example  6.6.3  and  Yt  is  the  number  of  deaths  and  serious  injuries 
on  UK  roads  in  month  t,  t  —  1,  . . . ,  120,  corresponding  to  the  10  years  beginning  with 
January  1975. 

To  account  for  the  seat-belt  legislation,  we  use  the  same  model  (6.6.15)  as  in 
Example  6.6.3  and,  because  of  the  apparent  non-stationarity  of  the  residuals,  we 
again  difference  both  [f] }  and  {Yt}  at  lag  12  to  obtain  the  model  (6.6.15),  i.e., 

Xt  —  bgt  +  Nt,  (11.2.4) 

where  Xt  =  gt  =  Vi^,  and  {Nt}  is  a  zero-mean  stationary  time  series.  This  is 

a  particularly  simple  example  of  the  general  intervention  model  (11.2.1)  for  the  series 
{YJ  with  intervention  { bgt }.  Our  aim  is  to  find  a  suitable  model  for  {Yr}  and  at  the 
same  time  to  estimate  b ,  taking  into  account  the  autocorrelation  function  of  the  model 
for  {Nj}.  To  apply  intervention  analysis  to  this  problem  using  ITSM,  we  proceed  as 
follows: 

(1)  Open  the  bivariate  project  SBL2.TSM  and  difference  the  series  at  lag  12. 

(2)  Select  Transf er>Specify  model  and  you  will  see  that  the  default  input 
and  noise  are  white  noise,  while  the  default  transfer  model  relating  the  input  gt 
to  the  output  Xt  is  Xt  —  bgt  with  b  —  1.  Click  OK,  leaving  these  settings  as 
they  are.  The  input  model  is  irrelevant  for  intervention  analysis  and  estimation  of 
the  transfer  function  with  the  default  noise  model  will  give  us  the  ordinary  least 
squares  estimate  of  b  in  the  model  (10.2.4),  with  the  residuals  providing  estimates 
of  Nt.  Now  select  Transf  er>Estimationand  click  OK.  You  will  then  seethe 
estimated  value  —346.9  for  b.  Finally,  press  the  red  Export  button  (top  right  in  the 
ITSM  window)  to  export  the  residuals  (estimated  values  of  Nt)  to  a  file  and  call 
it,  say,  NOISE.TSM. 

(3)  Without  closing  the  bivariate  project,  open  the  univariate  project  NOISE.TSM. 
The  sample  ACF  and  PACF  of  the  series  suggests  either  an  MA(13)  or  AR(13) 
model.  Fitting  AR  and  MA  models  of  order  up  to  13  (with  no  mean-correction) 
using  the  option  Model>Estimation>Autof  it  gives  an  MA(12)  model  as 
the  minimum  AICC  fit. 

(4)  Return  to  the  bivariate  project  by  highlighting  the  window  labeled  SBL2.TSM 
and  select  Transf er>Specify  model.  The  transfer  model  will  now  show 
the  estimated  value  —346.9  for  b.  Click  on  the  Residual  Model  tab,  enter  12 
for  the  MA  order  and  click  OK.  Select  Transfer>Estimation  and  again  click 
OK.  The  parameters  in  both  the  noise  and  transfer  models  will  then  be  estimated 
and  printed  on  the  screen.  Repeating  the  minimization  with  decreasing  step-sizes, 
0.1,  0.01  and  then  0.001,  gives  the  model, 

Y,  =  -362.5 gt+Nt, 

where  Nt  =  W?+0.207Wr_i+0.311Wr_2+0.105W,_3+0.040Wr_4+0.194W,_5  + 
0.100W7_6  +  0.299 W,_7+0.080W,_8  +  0. 125  W,_9  +  0.210Wr_io-K).109W,_ii+ 
0.501W7_i2,  and  {Wr}  ~  WN(0, 17289).  File  the  residuals  (which  are  now  esti¬ 
mates  of  { Wt})  as  RES.TSM.  The  differenced  series  {YJ  and  the  fitted  intervention 
term,  —362.5 gt,  are  shown  in  Figure  11-3. 
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residuals  from  the  model  in  0  10  20  30  40 

Example  1 1 .2.1  Lag 

(5)  Open  the  univariate  project  RES.TSM  and  apply  the  usual  tests  for  ran¬ 
domness  by  selecting  Statistics>Residual  Analysis.  The  tests  are 
all  passed  at  level  0.05,  leading  us  to  conclude  that  the  model  found  in  step  (4)  is 
satisfactory.  The  sample  ACF  of  the  residuals  is  shown  in  Figure  11-4. 


□ 


11.3  Nonlinear  Models 

A  time  series  of  the  form 

oo 

x,  =  fjZt-j,  {Ztj  ~  HD  (0,  a2)  ,  (11.3.1) 

j= o 

where  Zt  is  expressible  as  a  mean  square  limit  of  linear  combinations  of  [Xs,  oo  <  s  < 
t}9  has  the  property  that  the  best  mean  square  predictor  E{Xt+h\Xs,  —  oo  <  s  <  t)  and 
the  best  linear  predictor  PtXt+h  in  terms  of  {Xs,  —  oo  <  s  <  t]  are  identical.  It  can  be 
shown  that  if  iid  is  replaced  by  WN  in  (1 1.3. 1),  then  the  two  predictors  are  identical  if 
and  only  if  {Zt}  is  a  martingale  difference  sequence  relative  to  {Xr},  i.e.,  if  and  only 
if  E(Zt  \XS,  —  oo  <  s  <  t)  =  0  for  all  t. 

The  Wold  decomposition  (Section  2.6)  ensures  that  every  purely  nondeterministic 
stationary  process  can  be  expressed  in  the  form  (11.3.1)  with  {Zt}  ~  WN  (0,  a2).  The 
process  {Zt}  in  the  Wold  decomposition,  however,  is  generally  not  an  iid  sequence, 
and  the  best  mean  square  predictor  of  Xt+h  may  be  quite  different  from  the  best  linear 
predictor. 

In  the  case  where  {Xr}  is  a  purely  nondeterministic  Gaussian  stationary  process, 
the  sequence  {Zt}  in  the  Wold  decomposition  is  Gaussian  and  therefore  iid.  Every 
stationary  purely  nondeterministic  Gaussian  process  can  therefore  be  generated  by 
applying  a  causal  linear  filter  to  an  iid  Gaussian  sequence.  We  shall  therefore  refer  to 
such  a  process  as  a  Gaussian  linear  process. 
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In  this  section  we  shall  use  the  term  linear  process  to  mean  a  process  {X,}  of  the 
form  (11.3.1).  This  is  a  more  restrictive  use  of  the  term  than  in  Definition  2.2.1. 


1 1 .3.1  Deviations  from  Linearity 

Many  of  the  time  series  encountered  in  practice  exhibit  characteristics  not  shown  by 
linear  processes,  and  so  to  obtain  good  models  and  predictors  it  is  necessary  to  look  to 
models  more  general  than  those  satisfying  (11.3.1)  with  iid  noise.  As  indicated  above, 
this  will  mean  that  the  minimum  mean  squared  error  predictors  are  not,  in  general, 
linear  functions  of  the  past  observations. 

Gaussian  linear  processes  have  a  number  of  properties  that  are  often  found  to 
be  violated  by  observed  time  series.  The  former  are  reversible  in  the  sense  that 
(Xtl,  ■  •  •  ,Xtn)  has  the  same  distribution  as  (Xtn,  •  •  •  ,Xtl)  .  (Except  in  a  few  special 
cases,  ARMA  processes  are  reversible  if  and  only  if  they  are  Gaussian  (Breidt  and 
Davis  1992).)  Deviations  from  this  property  by  observed  time  series  are  suggested  by 
sample  paths  that  rise  to  their  maxima  and  fall  away  at  different  rates  (see,  for  exam¬ 
ple,  the  sunspot  numbers  filed  as  SUNSPOTS. TSM).  Bursts  of  outlying  values  are 
frequently  observed  in  practical  time  series  and  are  seen  also  in  the  sample  paths  of 
nonlinear  (and  infinite-variance)  models.  They  are  rarely  seen,  however,  in  the  sample 
paths  of  Gaussian  linear  processes.  Other  characteristics  suggesting  deviation  from 
a  Gaussian  linear  model  are  discussed  by  Tong  (1990). 

Many  observed  time  series,  particularly  financial  time  series,  exhibit  periods 
during  which  they  are  “less  predictable”  (or  “more  volatile”),  depending  on  the  past 
history  of  the  series.  This  dependence  of  the  predictability  (i.e.,  the  size  of  the  predic¬ 
tion  mean  squared  error)  on  the  past  of  the  series  cannot  be  modeled  with  a  linear  time 
series,  since  for  a  linear  process  the  minimum  h- step  mean  squared  error  is  independent 
of  the  past  history.  Linear  models  thus  fail  to  take  account  of  the  possibility  that 
certain  past  histories  may  permit  more  accurate  forecasting  than  others,  and  cannot 
identify  the  circumstances  under  which  more  accurate  forecasts  can  be  expected. 
Nonlinear  models,  on  the  other  hand,  do  allow  for  this.  The  ARCH  and  GARCH 
models  considered  in  Section  7.2  are  in  fact  constructed  around  the  dependence  of 
the  conditional  variance  of  the  process  on  its  past  history. 


1 1 .3.2  Chaotic  Deterministic  Sequences 

To  distinguish  between  linear  and  nonlinear  processes,  we  need  to  be  able  to  decide  in 
particular  when  a  white  noise  sequence  is  also  iid.  Sequences  generated  by  nonlinear 
deterministic  difference  equations  can  exhibit  sample  correlation  functions  that  are 
very  close  to  those  of  samples  from  a  white  noise  sequence.  However,  the  deterministic 
nature  of  the  recursions  implies  the  strongest  possible  dependence  between  successive 
observations.  For  example,  the  celebrated  logistic  equation  (see  May,  1976,  and  Tong 
1990)  defines  a  sequence  {. xn },  for  any  given  jco,  via  the  equations 

xn  =  4x„_i(l  -xn-i),  0  <  x0  <  1. 

The  values  of  xn  are,  for  even  moderately  large  values  of  n ,  extremely  sensitive 
to  small  changes  in  x$.  This  is  clear  from  the  fact  that  the  sequence  can  be  expressed 
explicitly  as 

xn  —  sin2  (2warcsin  (v^o))  >  n  —  0,  1,  2, ... . 

A  very  small  change  8  in  arc  sin  (^o)  leads  to  a  change  2 n  8  in  the  argument  of  the  sine 
function  defining  xn.  If  we  generate  a  sequence  numerically,  the  generated  sequence 
will,  for  most  values  of  xq  in  the  interval  (0,1),  be  random  in  appearance,  with  a 
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Figure  11-5 

A  sequence  generated  by 
the  recursions 
xn  =4xn_-|(1  -x„_i) 


function  of  the  sequence  in  0  10  20  30  40 

Figure  11-5  Lag 


sample  autocorrelation  function  similar  to  that  of  a  sample  from  white  noise.  The 
data  file  CHAOS. TSM  contains  the  sequence  x\,  . . .  ,  X200  (correct  to  nine  decimal 
places)  generated  by  the  logistic  equation  with  xq  =  n/10.  The  calculation  requires 
specification  of  xo  to  at  least  70  decimal  places  and  the  use  of  correspondingly  high 
precision  arithmetic.  The  series  and  its  sample  autocorrelation  function  are  shown  in 
Figures  11-5  and  11-6.  The  sample  ACF  and  the  AICC  criterion  both  suggest  white 
noise  with  mean  0.4954  as  a  model  for  the  series.  Under  this  model  the  best  linear 
predictor  of  X201  would  be  0.4954.  However,  the  best  predictor  of  X201  to  nine  decimal 
places  is,  in  fact,  4x2oo(l  —  *200)  =  0.016286669,  with  zero  mean  squared  error. 

Distinguishing  between  iid  and  non-iid  white  noise  is  clearly  not  possible  on  the 
basis  of  second-order  properties.  For  insight  into  the  dependence  structure  we  can 
examine  sample  moments  of  order  higher  than  two.  For  example,  the  dependence  in  the 
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data  in  CHAOS. TSM  is  reflected  by  a  significantly  nonzero  sample  autocorrelation  at 
lag  1  of  the  squared  data.  In  the  following  paragraphs  we  consider  several  approaches 
to  this  problem. 


1 1.3.3  Distinguishing  Between  White  Noise  and  iid  Sequences 

If  {X,}  ~  WN  (0,  a2)  and  E\Xt\  4  <  oo,  a  useful  tool  for  deciding  whether  or  not 
{Xr}  is  iid  is  the  ACF  pxi(h)  of  the  process  {Xr2}.  If  PC}  is  iid,  then  pxi  (h)  =  0  for 
all  h  ^  0,  whereas  this  is  not  necessarily  the  case  otherwise.  This  is  the  basis  for  the 
test  of  McLeod  and  Li  described  in  Section  1.6. 

Now  suppose  that  PC}  is  a  strictly  stationary  time  series  such  that  E\Xt\k  <  K  <  oo 
for  some  integer  k  >  3.  The  &th-order  cumulant  . . . ,  7>_i)  of  PC}  is  then 

defined  as  the  joint  cumulant  of  the  random  variables,  Xt,  Xt+n,  . . . ,  Xt+rk_y,  i.e.,  as 
the  coefficient  of  ikZ\Z,2  •  •  •  Zk  in  the  Taylor  expansion  about  (0,  . . . ,  0)  of 

X(z\,  . . . ,  Zk)  :=  In E[exp(iz\Xt  +  iziXt+n  - I"  /zpC+r^)].  (11.3.2) 

(Since  {XJ  is  strictly  stationary,  this  quantity  does  not  depend  on  t.)  In  particular,  the 
third-order  cumulant  function  C3  of  PC}  coincides  with  the  third-order  central  moment 
function,  i.e., 


C3(r,  s)  =  E[ PC  -  /PPC+r  -  p){Xt+s  -  /x)],  r,  s  e  {0,  ±1, . . .}, 


where  /z  =  EXt.  If  |C3(a  s)\  <  00,  we  define  the  third-order  polyspectral 

density  (or  bispectral  density)  of  {Xr}  to  be  the  Fourier  transform 


/3(uq,  0)2) 


^  00  00 

2  y,  y  c3(r,  s)e~ira)1~lsc°2,  -7T  <  (Ox,  co2  <  n, 

v  '  r=—oos=—oo 


in  which  case 

C3(r,s)=  f  f  e'mi+lSM2f3(a)U  co2)dcoi  dco2. 

J  —7T  J  —TV 


[More  generally,  if  the  kth  order  cumulants  Q(r  1,  •  •  •  ,  r^_i),  of  PC}  are  absolutely 
summable,  we  define  the  kth  order  polyspectral  density  as  the  Fourier  transform  of 
Cfc.  For  details  see  Rosenblatt  (1985)  and  Priestley  (1988).] 

If  PC}  is  a  Gaussian  linear  process,  it  follows  from  Problem  10.3  that  the  cumulant 
function  C3  of  {XJ  is  identically  zero.  (The  same  is  also  true  of  all  the  cumulant 
functions  Q  with  k  >  3.)  Consequently,  /3(uq,  (02)  =  0  for  all  oo  1,  0)2  e  [ — jt,  tt]. 
Appropriateness  of  a  Gaussian  linear  model  for  a  given  data  set  can  therefore  be 
checked  by  using  the  data  to  test  the  null  hypothesis  /3  =  0.  For  details  of  such  a 
test,  see  Subba-Rao  and  Gabr  (1984). 

If  PC}  is  a  linear  process  of  the  form  (11.3. 1)  with  E\Zt\3  <  00,  EZ]  —  rj,  and 
<  00,  it  can  be  shown  from  (1 1.3.2)  (see  Problem  1 1.3)  that  the  third-order 
cumulant  function  of  {Xr}  is  given  by 


00 


c3(r,s )  =  v  y  ^i^i  +  r^i+s 


i=—oo 


(11.3.3) 


(with  ip  ,  =  0  for  j  <  0),  and  hence  that  {Xr}  has  bispectral  density 


f3(a>  UCO2) 


-ZLf  [ei{(0l+y  f  (e-y  f  (e~y  , 

I  J  L 


(11.3.4) 
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where  :=  VO'^7-  By  Proposition  4.3.1,  the  spectral  density  of  {XJ  is 


/M  = 


t  {e~,w) 


Hence, 


(p(a>i,co2) 


[fiicou  m)\2 

f(w]  )f(0J2)f(0J\  +  co2) 


T] 


2 


2na6 


Appropriateness  of  the  linear  process  (11.3.1)  for  modeling  a  given  data  set  can 
therefore  be  checked  by  using  the  data  to  test  for  constancy  of  002)  (Subba-Rao 
and  Gabr  1984). 


1 1 .3.4  Three  Useful  Classes  of  Nonlinear  Models 


If  it  is  decided  that  a  linear  Gaussian  model  is  not  appropriate,  there  is  a  choice  of  sev¬ 
eral  families  of  nonlinear  processes  that  have  been  found  useful  for  modeling  purposes. 
These  include  bilinear  models,  autoregressive  models  with  random  coefficients,  and 
threshold  models.  Excellent  accounts  of  these  are  available  in  Subba-Rao  and  Gabr 
(1984),  Nicholls  and  Quinn  (1982),  and  Tong  (1990),  respectively. 

The  bilinear  model  of  order  (p,  q,  r,  s)  is  defined  by  the  equations 


p 

Xt  =  Zt  +  atXt_i 

i= 1 


CijXf—iZf— 


-7’ 


(11.3.5) 


where  {Zt}  ~  iid  (0,  a2) .  A  sufficient  condition  for  the  existence  of  a  strictly  stationary 
solution  of  these  equations  is  given  by  Liu  and  Brockwell  (1988). 

A  random  coefficient  autoregressive  process  {Xr}  of  order  p  satisfies  an  equation 
of  the  form 


p 

X,  =  (h  +  ut})  Xt-i  +  ^ 

/=! 


where  [Zt]  ~  IID  (0,  a2),  {t/rW}  ~  IID  (0,  v2),  {Zt}  is  independent  of  [Ut],  and 
(j)\ ,  . . . ,  (ftp  £  M. 

Threshold  models  can  be  regarded  as  piecewise  linear  models  in  which  the  linear 
relationship  varies  with  the  values  of  the  process.  For  example,  if  R{l\  i  =  1, . . . ,  k,  is 
a  partition  of  W ,  and  { Zt }  ~  IID(0,  1),  then  the  k  difference  equations 

p 

xt  =  0 ®Zr  +  YJtf)Xt-j,  (Xt_u  •  •  •  ,Xt_p)  G  R(i\  i  =  1,  •  •  •  ,k,  (11.3.6) 

7=1 

define  a  threshold  AR (p)  model.  Model  identification  and  parameter  estimation  for 
threshold  models  can  be  carried  out  in  a  manner  similar  to  that  for  linear  models  using 
maximum  likelihood  and  the  AIC  criterion. 


1 1 .4  Long-Memory  Models 

The  autocorrelation  function  p(-)  of  an  ARM  A  process  at  lag  h  converges  rapidly  to 
zero  as  h  — >  00  in  the  sense  that  there  exists  r  >  1  such  that 


(11.4.1) 
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Stationary  processes  with  much  more  slowly  decreasing  autocorrelation  function, 
known  as  fractionally  integrated  ARMA  processes,  or  more  precisely  as  ARIMA 
( p ,  d,  q)  processes  with  0  <  \d\  <  0.5,  satisfy  difference  equations  of  the  form 

(1  -  B)d<t>{B)Xt  =  0(B)Z„  (11.4.2) 


where  <fi(z)  and  0(z)  are  polynomials  of  degrees  p  and  q ,  respectively,  satisfying 


</>(z)^0  and  0{z)±  0 


for  all  z  such  that  \z  <  1 


5  is  the  backward  shift  operator,  and  [Zt]  is  a  white  noise  sequence  with  mean  0  and 
variance  a2.  The  operator  (1  —  B)d  is  defined  by  the  binomial  expansion 


oo 

(1  -B)d  =  J2n>BJ’ 

j= 0 

where  n()  =  1  and 

1  r  k  —  1  —  d 

n-i  —  11  jc 

o  <k<j 


The  autocorrelation  p(h)  at  lag  h  of  an  ARIMA (p,  d ,  g)  process  with  0  <  \d\  <  0.5 
has  the  property 

p(h)hl~2d c  ^  0  as  h  — >  oo.  (11.4.3) 


This  implies  (see  (11.4.1))  that  p(h)  converges  to  zero  as  h  oo  at  a  much  slower 
rate  than  p(h)  for  an  ARMA  process.  Consequently,  fractionally  integrated  ARMA 
processes  are  said  to  have  “long  memory.”  In  contrast,  stationary  processes  whose  ACF 
converges  to  0  rapidly,  such  as  ARMA  processes,  are  said  to  have  “short  memory.” 

A  fractionally  integrated  ARIMA(p,  d ,  q)  process  can  be  regarded  as  an  ARMA 
(/?,  q)  process  driven  by  fractionally  integrated  noise;  i.e.,  we  can  replace  equa¬ 
tion  (11.4.2)  by  the  two  equations 

4>(B)Xt  =  0(B)Wt  (11.4.4) 

and 

(1  -B)dWt  =  Zt.  (11.4.5) 


The  process  {Wt}  is  called  fractionally  integrated  white  noise  and  can  be  shown  (see, 
e.g.,  Brockwell  and  Davis  (1991),  Section  13.2)  to  have  variance  and  autocorrelations 
given  by 

9r(l  -2d) 

yw(0)  =  o2—t - -  (11.4.6) 

Y  r2(i-</) 

and 


„ ,  r(h  +  d)r(i-d)  t-t  k-i  +  d  7 

nw(n)  = - =  - ,  n  =  1,2,..., 

T(h-  d+  l)T(d)  11  k-d 

0  <k<h 

(11.4.7) 

where  r(-)  is  the  gamma  function  (see  Example  ( d )  of  Section  A.l).  The  exact 
autocovariance  function  of  the  ARIMA (p,  d ,  q)  process  {XJ  defined  by  (1 1.4.2)  can 
therefore  be  expressed,  by  Proposition  2.2.1,  as 

oo  oo 

Yx(h)  =  X!  XI  ^kYwih  +j  ~  k), 
j= 0  k= 0 


(11.4.8) 
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Example  11.4.1 


Figure  11-7 

Annual  minimum  water 
levels  of  the  Nile  river 
for  the  years  622-871 


where  —  #(z)/0(z),  |z|  <  1,  and  yw( •)  is  the  autocovariance  function  of 

fractionally  integrated  white  noise  with  parameters  d  and  a2,  i.e., 

Yw(h)  —  Yw(®)Pw(h), 


with  yw( 0)  and  as  in  (1 1.4.6)  and  (1 1.4.7).  The  series  (1 1.4.8)  converges  rapidly 
as  long  as  </>(z)  does  not  have  zeros  with  absolute  value  close  to  1. 

The  spectral  density  of  {Xr}  is  given  by 


m 


6(e~iX) 


2n 


<p(e  lX) 


(11.4.9) 


Calculation  of  the  exact  Gaussian  likelihood  of  observations  {jci,  . . . ,  xn}  of  a  frac¬ 
tionally  integrated  ARMA  process  is  very  slow  and  demanding  in  terms  of  computer 
memory.  Instead  of  estimating  the  parameters  d ,  <p\,  . . . ,  <pp,  6\,  ... ,  6q,  and  a1  by 
maximizing  the  exact  Gaussian  likelihood,  it  is  much  simpler  to  maximize  the  Whittle 
approximation  Lw,  defined  by 

—  21n(Lw)  =  nln(2it)  -\-2nlncr  +  cr~2  n ^  ^  In g(coj),  (11.4.10) 

“  g(o>j) 


where  In  is  the  periodogram,  a2g/(27t)(=  f )  is  the  model  spectral  density,  and  JT 
denotes  the  sum  over  all  nonzero  Fourier  frequencies  coj  =  2jt j/n  e  (—tv,  tv].  The 
program  ITSM  estimates  parameters  for  ARIMA (p,  d ,  q)  models  in  this  way.  It  can 
also  be  used  to  predict  and  simulate  fractionally  integrated  ARMA  series  and  to 
compute  the  autocovariance  function  of  any  specified  fractionally  integrated  ARMA 
model. 


Annual  Minimum  Water  Levels;  NILE.TSM 

The  data  file  NILE.TSM  consists  of  the  annual  minimum  water  levels  of  the  Nile 
river  as  measured  at  the  Roda  gauge  near  Cairo  for  the  years  622-87 1 .  These  values 
are  plotted  in  Figure  11-7  with  the  corresponding  sample  autocorrelations  shown  in 
Figure  11-8.  The  rather  slow  decay  of  the  sample  autocorrelation  function  suggests 
the  possibility  of  a  fractionally  intergrated  model  for  the  mean-corrected  series  Yt  — 
Xt  -  1119. 


0  50  100  150  200  250 
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function  of  the  data  0  10  20  30  40 

in  Figure  1 1  -7  Lag 


The  ARMA  model  with  minimum  (exact)  AICC  value  for  the  mean-corrected 
series  {Yt}  is  found,  using  Model>Estimation>Autof  it,  to  be 

Yt  =  -  0.323 Yt-i  -  0.060Fr_2  +  0.633F,_3  +  0.069F,_4  +  0.248F,_5 

+  Z, +  0.702Zr_i  +0.350Z,_2  -0.419Z?_3,  (11.4.11) 

with  {Z,}  -  WN(0,  5663.6)  and  AICC=  2889.9. 

To  fit  a  fractionally  integrated  ARMA  model  to  this  series,  select  the  option 
Model>Specify,  check  the  box  marked  Fractionally  integrated 
model,  and  click  on  OK.  Then  select  Model >Estimation>Autof  it,  and  click 
on  Start.  This  estimation  procedure  is  relatively  slow  so  the  specified  ranges  for  p 
and  q  should  be  small  (the  default  is  from  0  to  2).  When  models  have  been  fitted  for 
each  value  of  (p ,  q ),  the  fractionally  integrated  model  with  the  smallest  modified  AIC 
value  is  found  to  be 

(1  _  £)0.3830(1  _  0 A694B  +  0.9704 B2)Y,  =  (1  -  0.1800#  +  0.9278#2)Zr, 

(11.4.12) 

with  {Zt}  ~  WN(0,  5827.4)  and  modified  AIC=  2884.94.  (The  modified  AIC  statis¬ 
tic  for  estimating  the  parameters  of  a  fractionally  integrated  ARMA (p,  q)  process  is 
defined  in  terms  of  the  Whittle  likelihood  Lw  as  —  21n  Lw  +  2(p  +  q  +  2)  if  d  is 
estimated,  and  — 21n Lw  +  2 (p  +  q  +  1)  otherwise.  The  Whittle  likelihood  was  defined 
in  (11.4.10).) 

In  order  to  compare  the  models  (11.4.11)  and  (11.4.12),  the  modified  AIC 
value  for  (11.4.11)  is  found  as  follows.  After  fitting  the  model  as  described  above, 
select  Model>Specify,  check  the  box  marked  Fractionally  integrated 
model,  set  d  =  0  and  click  on  OK.  Next  choose  Model >Estimation>Max 
likelihood,  check  No  optimization  and  click  on  OK.  You  will  then  see  the 
modified  AIC  value,  2884.58,  displayed  in  the  ML  estimates  window  together 
with  the  value  2866.58  of  — 21n Lw. 

The  ARMA(5,3)  model  is  slightly  better  in  terms  of  modified  AIC  than  the 
fractionally  integrated  model  and  its  ACF  is  closer  to  the  sample  ACF  of  the  data  than 
is  the  ACF  of  the  fractionally  integrated  model.  (The  sample  and  model  autocorrelation 
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functions  can  be  compared  by  clicking  on  the  third  yellow  button  at  the  top  of  the  ITSM 
window.)  The  residuals  from  both  models  pass  all  of  the  ITSM  tests  for  randomness. 

Figure  11-9  shows  the  graph  of  {v2oo>  •  •  •  ,*250}  with  predictors  of  the  next  20 
values  obtained  from  the  model  (11.4.12)  for  the  mean-corrected  series. 

□ 


1 1 .5  Continuous-Time  ARMA  Processes 


Time  series  frequently  consist  of  observations  of  a  continuous-time  process  {Y(t),t> 
0}  or  [Y(t),  t  e  R}  at  a  discrete  sequence  of  observation  times.  It  is  then  natural,  even 
though  the  observations  are  made  at  discrete  times,  to  model  the  data  by  fitting  the 
underlying  continuous-time  process. 

Even  if  there  is  no  underlying  continuous-time  process,  it  may  still  be  advan¬ 
tageous  to  model  the  data  as  observations  of  a  continuous-time  process  sampled  at 
discrete  times.  For  example,  the  analysis  of  time  series  data  observed  at  irregularly 
spaced  times  can  be  handled  very  conveniently  by  regarding  the  data  as  sampled  values 
of  a  continuous-time  process  (see  Jones  1980  and  equation  (11.5.6)  below). 

Continuous-time  models  also  provide  a  unifying  framework  for  data  collected 
when  a  time  series  is  observed  at  different  frequencies,  i.e.,  with  different  spacings 
between  the  observation  times.  Instead  of  requiring  different  discrete-time  models  to 
represent  observations  collected  at  different  frequencies,  continuous-time  modelling 
provides  a  single  model  which  can  be  sampled  at  any  frequency  whatsoever. 

When  very  high-frequency  observations  are  available  (as  in  many  financial  and 
turbulence  studies),  the  relation  between  the  high-frequency  sequence  and  the  under¬ 
lying  continuous-time  process  is  also  of  interest  since  the  high-frequency  observations 
provide  a  natural  source  of  information  regarding  the  continuous-time  process  of 
which  the  discrete  observations  are  a  sample. 

Stationarity  of  a  continuous-time  process  {F(0}  (cf.  Definition  1.4.2)  means  that 
EY(t)  and  Co \(Y(t  +  h),  Y{t))  are  defined  and  independent  of  t  for  all  h  >  0.  Strict 
stationarity  means  that  (F(0),  . . . ,  Y(tn))  and  (Y(t\  +h), . . . ,  Y(tn  +  h))  have  the  same 
joint  distributions  for  all  t\,  ... ,  tn,  all  h  >  0  and  all  positive  integers  n. 


Figure  11-9 

The  minimum  annual 
Nile  river  levels  for  the 
years  821-871 ,  with 
20  forecasts  based  on 
the  model  (1 1 .4.1 2) 
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Continuous-time  ARMA  (or  CARMA)  processes  are  defined  as  stationary  solu¬ 
tions  of  stochastic  differential  equations  analogous  to  the  difference  equations  that  are 
used  to  define  discrete-time  ARMA  processes.  They  play  a  role  in  continuous-time 
modelling  analogous  to  that  of  ARMA  processes  in  discrete  time. 

We  shall  begin  with  the  Gaussian  continuous-time  AR(1)  process,  also  known  as 
the  stationary  Gaussian  Ornstein-Uhlenbeck  process. 


11.5.1  The  Gaussian  CAR(1)  Process,  {7(f),  f  >  0} 

The  Gaussian  CAR(l)  process  [Y(t),  t  >  0}  is  defined  as  a  strictly  stationary  solution 
of  the  first-order  stochastic  differential  equation, 

DY(t)  +  aY(t)  =  crDB(t)  +  c,  t  >  0,  (11.5.1) 

where  the  operator  D  denotes  differentiation  with  respect  to  t,  { B(t ),  t  e  M}  is 
standard  Brownian  motion  (see  Example  7.5.1),  a ,  c,  and  o  are  parameters  and  F(0) 
is  a  normally  distributed  random  variable  independent  of  [B(t)  —  B(s),  0  <  s  <  t  < 
oo}.  The  derivative  DB(t )  does  not  exist  in  the  usual  sense,  so  equation  (11.5.1)  is 
interpreted  as  the  Ito  differential  equation  (see  Appendix  D.4), 

dY(t)  +  aY(t)dt  =  odB(t)  +  cdt,  t  >  0,  (11.5.2) 

with  dY (t)  and  dB(t)  denoting  the  increments  of  Y  and  B  in  the  time  interval  (t,  t  +  dt) . 

Standard  theory  of  deterministic  linear  differential  equations  suggests  multiplying 
this  equation  by  eat  in  which  case  the  left-hand  side  would  become  d(eatY{t)).  We 
therefore  apply  Ito’s  formula  (Appendix,  equation  (D.3.7))  to  d(eatY(t ))  with  g(t,  x )  := 
eatx  and  we  obtain  exactly  the  same  result  since  the  second  partial  derivative  gxx  is  zero. 
Hence  we  can  rewrite  (11.5.2)  as 

d(eatY(t ))  =  creatdB(t )  +  ceatdt , 

or  equivalently, 

eatY(t)  -  7(0)  =  a  f  eaudB(u )  +  c  f  eaudu. 

Jo  Jo 

Thus 

Y(t)  =  e~atY{ 0)  +a  f  e~aU~u) dB(u)  +  c  f  e~a(t~u)  du.  (11.5.3) 

Jo  Jo 

Remark  1.  The  Ito  integral  jl{)  e~a^~u^ dB(u)  in  (11.5.3)  is  of  a  special  type  in  which 
the  integrand  is  deterministic.  This  permits  the  application  of  integration  by  parts  to 
obtain  a  pathwise  representation  of  Y  as 

Y(t)  =  e~atY( 0)  +  crB(t)  -a  f  ae~a{t-u)B(u)du  +  c  f  e~a{t~u)du. 

Jo  Jo  n 

If  a  >  0  and  F(0)  has  mean  cl  a  and  variance  a2 /(2a),  it  is  easy  to  check, 
using  the  properties  of  Io,t(f)  =  f^f(u)dB(u)  in  Remark  3  of  Appendix  D.3  and  the 
independence  of  F(0)  and  { B(t )  —  B(s),  0  <  s  <  t  <  oo)  (Problem  11.4),  that  {F(0} 
as  defined  by  (11.5.3)  is  stationary  with 

2 

E(Y(t))  =  -  and  Co v(Y(t  +  h),  Y(t ))  =  —  e~ah , 
a  2  a 


t,h>  0.  (11.5.4) 
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Since  {F(0}  is  Gaussian  it  is  also  strictly  stationary.  Conversely,  if  {F(0}  is  strictly 
stationary,  then  by  equating  the  variances  of  both  sides  of  (11.5.3),  we  find  that 
(l  —  e~2at )  Var(F(0))  =  a2  f^e~2au  du  for  all  t  >  0,  and  hence  that  a  >  0  and 
Var(F(0))  =  cr2 /(2a).  Equating  the  means  of  both  sides  of  (11.5.3)  then  gives 
£’(F(0))  =  c/a.  Necessary  and  sufficient  conditions  for  {F(0}  to  be  strictly  stationary 
are  therefore  a  >  0,  E(F(0))  =  c/a ,  and  Var(F(0))  =  a2 /(2a). 

If  a  >  0  and  0  <  s  <  t,  it  follows  from  (11.5.3)  that  Y(t)  satisfies  the  relation 


Y(t)  =  e~a(t~s)Y{s)  +  -  (1  -  e~a(t~s))  +a  (  e~a{,-u)  dB(u),  t>s>  0. 

a  Js 


(11.5.5) 


This  shows  that  the  process  is  Markovian,  i.e.,  that  the  distribution  of  Y(t )  given 
Y(u),u  <  s ,  is  the  same  as  the  distribution  of  Y(t)  given  F(s).  It  also  shows  that 
the  conditional  mean  and  variance  of  F  (t)  given  F  (s)  are 

E(.Y(t)\Y(s))  =  e~a(t~s)Y{s)  +  c/a(  1  -  e~ai}~s) ) 


and 


Var(F(?)|F(s))  = 


— 2a(t— s)  J 


We  can  now  use  the  Markov  property  and  the  moments  of  the  stationary  distri¬ 
bution  to  write  down  the  likelihood  of  observations  y(t\),  . . . ,  y(tn)  at  times  t\,  ...  ,tn 
of  the  Gaussian  CAR(l)  process.  This  is  just  the  joint  density  of  (Y(t\), . . . ,  Y(tn))'  at 
(y(t\), . . . ,  y(tn)Y ,  which  can  be  expressed  as  the  product  of  the  stationary  density  at 
y(t\)  and  the  transition  densities  of  Y(tt)  given  Y(tt- 1)  =  y(tt- 1),  i  =  2,  ...  ,n.  The 
joint  density  g  is  therefore  given  by 


g(y(h), 


n  1 

. ,  y{tn)\  a,  c,  a2)  =  ]q  —f 


(11.5.6) 


wher e/(y)  =  n(y;  0,  1)  is  the  standard  normal  density,  m\  —  c/a ,  v\  —  o2 /(2a),  and 
for  i  >  1, 


and 


ni. 


=  e~a(-t‘~t‘~^y(ti-i)  +  -  (1  -  e-a(ti-ti-l)) 

a  v  7 


The  maximum  likelihood  estimators  of  a ,  c,  and  a2  are  the  values  that  maximize 
g  ( y(t\ ),  . . . ,  y(tn)\  ci,  c,  a2) .  These  can  be  found  with  the  aid  of  a  nonlinear  maximiza¬ 
tion  algorithm.  Notice  that  the  times  tt  appearing  in  (1 1.5.6)  are  quite  arbitrarily  spaced. 
It  is  this  feature  that  makes  the  CAR(l)  process  so  useful  for  modeling  irregularly 
spaced  data. 

If  the  observations  are  regularly  spaced,  say  tt  —  /,  i  =  1,  . . . ,  n,  then  the  joint 
density  g  is  exactly  the  same  as  the  joint  density  of  observations  of  the  discrete-time 
Gaussian  AR(1)  process 


a 


+  Zn, 


{Zt}  -  IIDN 


This  shows  that  the  “embedded”  (or  sampled)  discrete-time  process  |F(/),  i  = 
1,  2,  . . .}  of  the  CAR(l)  process  is  a  discrete-time  AR(1)  process  with  coefficient  e~a. 
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This  coefficient  is  clearly  positive,  immediately  raising  the  question  of  whether  there  is 
a  continuous-time  ARMA  process  for  which  the  embedded  process  is  a  discrete-time 
AR(1)  process  with  negative  coefficient.  It  can  be  shown  (Chan  and  Tong  1987)  that  the 
answer  is  yes  and  that  given  a  discrete-time  AR(1)  process  with  negative  coefficient, 
it  can  always  be  embedded  in  a  suitably  chosen  continuous-time  ARMA(2,1)  process. 


1 1.5.2  The  Gaussian  CARMA(p,  q )  Process,  (T(f),  f  €  R} 

We  define  a  zero-mean  Gaussian  CARMA (p,  q )  process  {F(OdeR)  (with  0  <  q  < 
p )  to  be  a  strictly  stationary  Gaussian  process  satisfying  the pth-order  linear  differential 
equation, 

DpY(t )  +  cnDp-lY(t)  +  •  •  •  +  apY(t) 

=  b0DB(t )  +  b\D2B(t)  H - +  bqDq+lB(t),  (11.5.7) 

where  U  denotes  j-fold  differentiation  with  respect  to  t ,  { B{t ),  t  e  R)  is  standard 
Brownian  motion,  and  a\,  ...  ,ap  and  bo,  ,  bq  are  constants.  We  assume  that  bq  /  0 
and  define  bj  \=  0  for  j  >  q.  We  shall  also  assume  that  the  polynomials,  a(z )  := 
zp  +  a\zp~l  +  •  •  •  +  ap  and  b(z)  '.=  bo  +  b\z  +  •  •  •  +  bqzq ,  have  no  common  zeroes. 
Since  the  derivatives  D  jB(t) ,  j  >  0,  do  not  exist  in  the  usual  sense,  we  interpret  (11.5.7) 
as  being  equivalent  (see  Remark  2  below)  to  the  observation  and  state  equations 

Y(t)  =  bfX(t),  (11.5.8) 

and 


dX(t)  =  AX(^)  dt  +  edB(t), 


where 


(11.5.9) 


A  = 


1 

o 

1 

0 

1 

O 

•  o 

•  o 

1 

•  • 

•  o 

• 

• 

0 

• 

• 

0 

• 

• 

0 

•  • 

•  • 

1 

1 

1 

^3 

1 

^3 

1 

• 

<N 

1 

1 

••  —01,  _ 

=  [ 


0  0 


0  1 


]  ,  b  =  [.0 


bi 


b 


p—2  Vp- 1 


b. 


differential  equation  for  the  state  vector  X(t)  (see  Appendix  D.4). 


,  and  (11.5.9)  is  an  Ito 


Remark  2.  Denoting  the  components  of  X(t)  by  Xj (t),  j  =  0, 1,  the  first  p—  1 
component  equations  of  (11.5.9)  are 

dXj(t )  —  Xj+\(t)dt,  j  =  0,  . . . , p  —  2, 

showing  that  Xj(t)  is  just  the  jth  derivative  of  Xo(t),  j  =  1, . . .  ,p  —  1.  The  last 
component  equation  of  (11.5.9)  is 

dXp_\(t )  =  —  (a\Xp_\  +  a2Xp_2  +  •  •  •  +  apXo(t))dt  +  dB(t), 

which  is  the  Ito  form  of  the  stochastic  differential  equation, 

DpX0(t )  +  a\Dp~lXo(t)  H - +  apX0(t)  =  DB(t).  (11.5.10) 

The  state  equation  (1 1.5.9)  is  thus  the  Ito  equation  for  the  vector  whose  first  component 
Xo(t)  satisfies  the  CARMA(p,  0)  (or  CAR(p))  equation  (11.5.10)  and  whose  remaining 
components  are  successively  higher  derivatives,  up  to  order  p  —  1,  of  Xo(t).  It  is  clear 
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from  the  linearity  of  equation  (1 1.5.7)  that  if  X(t)  satisfies  (1 1.5.9)  then  Xo(t)  satisfies 
(11.5.10)  and  the  linear  combination, 

Y(t )  =  boXo(t)  +  b\X\(t)  +  •  •  •  +  bp-iXp_i(t)  =  b'X(t), 

of  Xo(t)  and  its  derivatives  satisfies  (1 1.5.7).  This  explains  the  replacement  of  (1 1.5.7) 
by  the  observation  and  state  equations  (11.5.8)  and  (11.5.9).  □ 


If  X(0)  is  a  normally  distributed  random  vector  independent  of  { B(t )  —  B(s),  0  < 
s  <  t  <  oo},  then  equation  (11.5.9)  is  simply  a  vector  form  of  equation  (11.5.2)  and 
its  unique  solution  for  t  >  0,  as  specified  in  Appendix  D.4,  Theorem  D.4.1,  satisfies 

X(t)  =  eAtX(0)+  f  eMl~u)edB(u),  0<t<oo, 

Jo 

where  the  matrix  ef*  is  defined  in  the  usual  way  as  :=  Yljlo  JtJ - 

More  generally  (see  Appendix  D,  equation  (D.4. 6)),  if  for  each  S  e  R,  X(5)  has 
finite  second  moments  and  is  independent  of  {B(t)  —  B(s),  S  <  s  <  t  <  oo},  then  the 
unique  solution  of  (11.5.9)  specified  by  Theorem  D.4.1  satisfies 

X(t)  =  eA{t~S)X(S)  +  f  eMt~u)e  dB(u ),  t  >  S ,  for  all  S  eR.  (11.5.11) 

Js 

If  the  real  parts  of  the  eigenvalues  . . . ,  Xp  of  A  (which  are  also  the  zeroes  of 
the  autoregressive  polynomial  a(z ))  satisfy 


&e(kr)  <0,  r  —  1,  . . . ,/?, 


(11.5.12) 


and  if  {X(0}  is  a  stationary  solution  of  (11.5.11)  then,  taking  mean  square  limits  as 
S  ->  — oo  in  (11.5.11),  we  see  that  it  must  satisfy 


eA(,-u)e  dB(u), 


t  e  R. 


(11.5.13) 


Conversely,  if  {X(r)}  is  given  by  (11.5.13)  then  it  is  a  stationary  solution  of  (11.5.11) 
and  for  each  S  e  R,  X(S)  has  finite  second  moments  and  is  independent  of 
{Bit)  -  B(s),  S  <  s  <  t  <  oo}.  Hence  it  is  the  unique  solution  of  the  type 
specified  in  Theorem  D.4.1  with  these  properties.  The  property  that,  for  each  5,  X(S ) 
is  independent  of  {B{t)  —  B(s),  S  <  s  <  t  <  oo},  corresponds  to  the  discrete-time 
concept  of  causality  introduced  in  Section  3.1. 

Assuming  that  condition  (11.5.12)  is  satisfied,  we  define  the  zero-mean  causal 
Gaussian  CARMA (p,  q )  process  { Y{t ),  t  e  R},  with  parameters  (a\,  . . . ,  ap,  bo, 

. . . ,  bq),  by 

Y(t)  =  b'X(t),  (11.5.14) 


where  {X(0}  is  given  by  (11.5.13).  A  Gaussian  CARMA  process  with  mean  m  is 
obtained  by  simply  adding  the  constant  value  m  to  Y. 


Remark  3.  For  the  zero-mean  causal  Gaussian  CAR(l)  process  defined  by  (11.5.1), 
with  c  =  0  and  with  index  set  R  instead  of  [0,  oo)  as  in  Section  11.5.1,  we  have  b  =  a 
and  A  =  —a,  so  that 
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Y(t)  =  cr  f  e~a(t~u)dB(u),  t  e  R. 


□ 


J  — oo 


The  autocovariance  function  of  the  process  X(/)  at  lag  h  is  easily  found  from 
(11.5.13)  to  be 

Co  v(X(t  +  h),X(t))  =  eA|/i|S, 


where 


The  mean  and  autocovariance  function  of  the  CARMA (p,  q )  process  {F(r)}  are  there¬ 
fore  given  by 

EY(t)  =  0 
and 

Cov(Y(t  +  h),  Y(t ))  =  b'emY:h. 

Inference  for  a  CARMA (p,  q )  process  with  p  >  1  is  more  complicated  than  for  a 
CAR(l)  process  because  the  former  is  not  Markovian,  so  the  simple  argument  that  led 
to  (1 1.5.6)  no  longer  holds.  However,  the  Gaussian  likelihood  of  observations  at  times 
t\ ,  . . . ,  tn  can  still  easily  be  computed  using  the  discrete-time  Kalman  recursions  as 
pointed  out  by  Jones  (1980). 

Simulation  and  estimation,  not  only  for  Gaussian,  but  also  for  Levy-driven 
CARMA  processes  (as  introduced  in  the  following  subsection)  can  be  carried  out 
using  the  Yuima  package,  a  package  for  use  in  the  R  environment,  which  can  be 
downloaded  from  https://cran.r-project.org/web/packages/yuima.  A  detailed  account 
of  its  application  to  CARMA  processes  is  contained  in  the  paper  of  Iacus  and  Mercuri 
(2015).  A  simulated  Gaussian  (3,2)  process  and  the  components  of  its  state-vector, 
generated  in  R  by  the  Yuima  package,  is  shown  in  Figure  11-10. 

Rather  than  examining  Gaussian  CARMA  processes  in  more  detail,  we  next 
introduce  the  more  general  class  of  Levy-driven  CARMA (p,  q)  processes,  whose 
marginal  distributions  can  be  both  heavy-tailed  and  asymmetric  and  whose  sample- 
paths  are  continuous  if  q  <  p  —  1  and  have  the  same  jumps  as  the  driving  Levy  process 

if  q  =  p  —  1. 

1 1.5.3  Levy-driven  CARMA  Processes,  (F(f),  f  €  R} 

In  Section  1 1.5.2,  under  the  assumption  (1 1.5. 12),  we  defined  the  zero-mean  causal 
Gaussian  CARMA  process  {F(0,  t  e  R}  as  the  strictly  stationary  linear  combination 
(11.5.14)  of  components  of  the  state-vector  X(t)  given  by  (11.5.13).  In  this  section  we 
wish  to  extend  the  definition  by  replacing  the  driving  process  B  by  a  Levy  process  L 
in  order  to  allow  a  much  broader  class  of  possible  marginal  distributions  for  Y{t).  As 
in  Section  11.5.2  we  shall  assume  that  the  polynomials  a(z )  and  b(z)  in  the  defining 
stochastic  differential  equation, 

a(D)Y(t)  =  b(D)DL(t ), 

have  no  common  zeroes.  We  use  the  same  state-space  representation  of  the  process  as 
in  Section  11.5.2  to  obtain  a  rigorous  interpretation  of  this  equation. 


348 


Chapter  1 1  Further  Topics 


Figure  11-10 

Simulated  CARMA(3,2) 
process  y  and  state- vector  X 
driven  by  standard 
Brownian  motion  with 
a-\  —  4,  a2  =  4.5,  a 3  =  1 .5, 
£>0  =  1,  b 1  =  .23,  £>2  =  .35 


Replacing  the  Brownian  motion  {5(0}  in  (11.5.11)  by  the  Levy  process  {L(t)} 
gives 

X(t)  =  e'4(f-S)X(S)  +  /  dL(u ),  ?  >  5,  for  all  5  e  R.  (11.5.15) 

hsA 

where  the  integral  is  now  interpreted  in  the  sense  of  Protter  (2010).  We  then  define  the 
CARMA(p,  q)  process  driven  by  L,  with  coefficients  (a\, . . . ,  ap,  b\, . . . ,  bq ),  to  be 
a  strictly  stationary  solution  {7(0}  of  the  equation  (11.5.15)  and 

y(0  =  b'X(0-  (11.5.16) 

The  matrix  A  and  the  vector  b  are  defined  as  in  Section  1 1 .5.2,  except  that  we  now 
define  bq  :=  1  since  there  is  no  variance  constraint  on  L(l)  as  there  was  on  5(1)  in 
the  definition  of  the  Gaussian  special  cases  in  Sections  11.5.1  and  11.5.2.  In  the  case 
when  L  is  a  subordinator  (i.e.,  a  Levy  process  with  non-decreasing  sample  paths),  the 
integral  in  (1 1 .5. 15)  can  also  be  interpreted  as  a  pathwise  Stieltjes  integral  with  respect 
to  L. 

Brockwell  and  Lindner  (2009)  have  established  necessary  and  sufficient  condi¬ 
tions  for  the  existence  of  a  strictly  stationary  solution  {7(0,  t  e  R}  of  (11.5.15) 
and  (11.5.16).  If  we  assume  that  a(z )  and  b(z)  have  no  common  zeroes  and  L  is  not 
deterministic,  then  the  necessary  and  sufficient  conditions  are 

5max(0.  log  |L(1)|)  <  oo  (11.5.17) 

and 

&e(kr)  ^  0,  r  =  1,  . . .  (11.5.18) 

where  A.  i ,  . . . ,  Xp  are  the  eigenvalues  of  A  (which  are  also  the  zeroes  of  the  autoregres¬ 
sive  polynomial  a(z)). 

The  strictly  stationary  solution  is  unique,  and  under  the  stronger  conditions, 

.5. 


&e{\r)  <0,  r  —  1, 


•  •  • 


(11.5.19) 


1 1 .5  Continuous-Time  ARMA  Processes 


349 


Example  11.5.1. 


it  is  causal,  i.e.,  for  every  s,  t  and  u  such  that  s  <  t  <  u,  Y(s)  is  independent  of 
L(u)  —  L(t )  and  can  be  expressed  as 

7(0=/  b'^-^edLiu).  (11.5.20) 

(The  representation  (11.5.20)  is  easily  obtained  formally  by  letting  S  ->  —  oo  in 
(11.5.15)  and  substituting  the  resulting  expression  for  X(t)  in  (11.5.16).)  Thus,  under 
the  causality  condition  (11.5.19),  equations  (11.5.15)  and  (11.5.16)  have  the  unique 
strictly  stationary  solution  (1 1.5.20).  This  solution  is  the  causal  CARMA (p,  q)  process 
with  parameters  (rq, . . . ,  ap,  b\, . . . ,  bq  \=  1)  driven  by  the  Levy  process  L. 

From  equation  (11.5.20)  we  find  that,  if  E{L( l)2)  <  oo,  EY(t)  —  tibo/ap ,  where 
/ x  =  EL(  1),  and 

Yy(Ii)  :=  Co v[Y(t  +  h),  7(0]  =  ct2 bV^'Eb,  (11.5.21) 

where  a2  =  Var(L(l))  and  S  =  /0°°  c/'vee'(yV’-v(f_v. 

Remark  4.  A  result  of  Eller  (1987)  was  used  by  Brockwell  and  Lindner  (2009)  to 
rewrite  (11.5.20)  as 

J2  J2  cxk(t  -  u)kex{,-u)dL{u),  (11.5.22) 

v-oo,f]  k  k=() 

where  denotes  summation  over  the  zeroes  A  of  a(z ),  m(X)  is  the  multiplicity  of  the 
zero  A  and  Yll= n_1  c^keXt  is  the  residue  at  A  of  the  mapping  z  eztb(z)/a(z).  If  the 
zeroes,  Ai, . . . ,  Ap,  of  a(z)  each  have  multiplicity  one,  then  the  expression  (11.5.22) 
simplifies  to 

r  P 

Y(t)  —  I  22areXrit~U)dL(u ),  (11.5.23) 

where  ar  =  b{Xr)/ar{Xr).  Hence  {T(0}  has  a  corresponding  canonical  representation 
as  a  linear  combination  of  (possibly  complex-valued)  CAR(l)  processes, 

p 

Y(t)  =  Y^UrYr{t),  (11.5.24) 

r=  1 

where  Fr(0  =  eXr^~u) dL(u) .  Notice  that  the  driving  process  L  is  the  same  for 

each  of  the  component  processes  [Yr(t)},  so  they  are  not  independent.  Corresponding 
to  the  canonical  decomposition  (11.5.24),  if  E{L( l)2)  <  oo,  there  is  an  analogous 
representation  of  the  autocovariance  function  when  E(L( l)2)  <  oo,  namely 

p 

Y(h)  =  Y^PrekAh\  (11.5.25) 

r=l 

where  =  <j2b(Xr)b(—Xr)/[a(—Xr)a'(hr)].  □ 

Stochastic  Volatility 

The  stochastic  volatility  process,  h,  appearing  in  Example  7.5.4  was  defined  as 

hit)  =  I  ex{t~u)  dL(u ),  where  A  <  0,  (11.5.26) 

J  (— oo.t] 
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i.e.,  as  a  Levy-driven  CARMA(1,  0)  process  with  a(z )  =  z  —  A  and  /?(z)  =  1.  For 
non-negativity  of  h,  the  Levy  process  L  is  required  to  be  a  subordinator,  for  example, 
a  gamma  process  with  characteristic  function  Eel()L{t)  =  (1  —  id/P)~at,  EL(t)  — 
at/ ft  and  Var(L(l))  =  at/ ft2.  Then  Eh(t)  —  at /(ft |A|)  and  the  autocovariance 
function  of  h  is,  from  (11.5.27),  Yh(s)  =  ex^a/(2/32\X\).  For  any  finite-variance  Levy- 
driven  CARMA(1,0)  model  for  stochastic  volatility,  the  autocorrelation  function  is 
necessarily  of  the  form  p  (s)  =  ex^  for  some  negative  A.  In  order  to  relax  this  constraint 
a  non-negative  CARMA (p,  q)  model  for  h  can  be  employed  (see  e.g.,  Brockwell  and 
Lindner  2012). 

□ 

Remark  5.  Marginal  distributions.  The  condition  (11.5.17)  clearly  does  not 
require  L(l)  (and  consequently  Y(t))  to  have  finite  variance.  In  fact  the  condition 
E\L(1)\  r  <  oo  for  some  r  >  0  is  sufficient  to  ensure  that  (11.5.17)  holds.  Given 
a  CARMA (p,  q )  process  Y  driven  by  a  Levy  process  L  with  characteristic  function 
E(eWUt>)  =  exp (/£(0)),  6  e  R,  (see  Appendix  D.l),  the  joint  characteristic  function 
of  Y(t\), ,  Y(tn)  can  be  expressed  in  terms  of  the  coefficients  of  the  polynomials  a 
and  b  and  the  characteristic  exponent  £(•)  of  L  (see  Brockwell  (2014)).  In  particular 
the  logarithm  of  the  marginal  characteristic  function  of  Y (t)  is 

poo 

In Eem(t)=  %(eb'eAue)du,  9  e  R.  (11.5.27) 

Jo 

For  the  CAR(l)  process  h  defined  by  (11.5.26)  this  simplifies  to 

poo  pO 

In  Eewh(,)=  %(9elu)du  —  |A|_1  /  y~l%(y)dy,  (11.5.28) 

Jo  Jo 

(where  :=  -  /°  if  6  <  0).  If,  for  example,  L(l)  has  a  symmetric  stable  distribution 
with  In Eel0L^  =  —  c\6\a,  c  >  0,  0  <  a  <  2,  then  £'|L(l)|r  <  oo  for  all  r  e  (0,  a) 
and  from  (11.5.28)  we  find  at  once  (Problem  11.8)  that, 

In Eem(t)  = - —  \e\a,  (11.5.29) 

a\X\ 

in  other  words  h{t)  has  a  symmetric  stable  distribution  with  the  same  exponent  a  as 
L(l)  but  with  the  parameter  c  replaced  by  c/(a\ A |). 


Problems 


11.1  Find  a  transfer  function  model  relating  the  input  and  output  series  Xt\  and  X/2, 
t  =  1,  . . . ,  200,  contained  in  the  ITSM  data  files  APPJ.TSM  and  APPK.TSM, 
respectively.  Use  the  fitted  model  to  predict  ^201,2?  ^202,2,  and  ^203,2-  Compare 
the  predictors  and  their  mean  squared  errors  with  the  corresponding  predictors 
and  mean  squared  errors  obtained  by  modeling  {X?2}  as  a  univariate  ARM  A 
process  and  with  the  results  of  Problem  8.7. 

11.2  Verify  the  calculations  of  Example  11.2.1  to  fit  an  intervention  model  to  the 
series  SB.TSM. 

11.3  If  P6}  is  the  linear  process  (11.3.1)  with  {Zt}  ~  IID  (0,  a2)  and  rj  =  EZ],  show 
that  the  third-order  cumulant  function  of  {Xf}  is  given  by 

00 

C3(r,s)  =  rj  E,  ^i^i+r^i+s- 


l  =  —  OQ 
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Use  this  result  to  establish  equation  (11.3.4).  Conclude  that  if  {XJ  is  a  Gaussian 
linear  process,  then  C3(r,  s)  =  0  and/3(aq,  C02)  =  0. 


11.4  If  a  >  0  and  Y{ 0)  has  mean  b/a  and  variance  a2 /(2a),  show  that  the  pro¬ 
cess  defined  by  (11.5.3)  is  stationary  and  evaluate  its  mean  and  autocovariance 
function. 


11.5  The  file  TRINGS.TSM  contains  normalized  tree-ring  widths  of  a  Colorado  pine 
for  the  years  525-774  (Donald  Graybill  1984)  from  the  file  C0522.DAT  at  http:// 
www-personal.buseco.monash.edu.au/hyndman/TSDL/. 

a.  Use  exact  maximum  likelihood  estimation  to  fit  a  fractionally  integrated 
ARMA  model  to  the  first  230  tree-ring  widths  and  use  the  model  to  generate 
forecasts  and  95%  prediction  bounds  for  the  last  20  observations  (correspond¬ 
ing  to  t  =  231,  ... ,  250).  Plot  the  entire  data  set  with  the  forecasts  and 
prediction  bounds  superposed  on  the  graph  of  the  data. 

b.  Repeat  part  (a),  but  this  time  fitting  an  appropriate  ARMA  model.  Compare 
the  performance  of  the  two  sets  of  predictors. 


11.6  The  tent  map  with  parameter  s  £  (1,  00)  is  the  function 


g(x)  =  sxlm/s)(x)  + 


(1  -*)/[!/*,  !](*), 


X  6  [0,  1], 


where  I  a  denotes  the  indicator  function  of  the  set  A.  If  Xq  has  the  uniform 
distribution  on  [0,  I]  (written  more  concisely  as  Xq  ~  U)  and  if  {Xn}  is  the 
sequence  defined  by  Xn  =  g(Xn_  1),  n  =  1,2, ... ,  then  f Xn }  is  a  Markov  chain 
and  Xn  ~  U  for  all  n  e  {0,  1,  2,  . . .},  so  that  f Xn }  is  strictly  (and  weakly) 
stationary. 

a.  Show  that  in  the  symmetric  case  (s  =  2),  {Xn}  ~  WN(0,  1/12). 

b.  In  the  general  case,  Xn  —  0.5  =  <fi(Xn_i  —  0.5)  +  Zn,  n  =  1,2, ... ,  where 
0  =  (2/s)  —  1  and  {Zn}  is  an  uncorrelated  (but  strongly  dependent)  sequence 
of  random  variables  with  mean  zero  and  variance  (1  —  </r)/12.  (See  Sakai  and 
Tokumaru  1980.) 


11.7  A  Levy-driven  CARMA(2,1)  process  is  defined  by  the  stochastic  differential 
equation, 

(D2  +  1.5D  +  .5)130  =  (D  +  .2)DL(t),  t  e  E, 

where  L  is  a  Poisson  process  with  jump-rate  p. 

a.  Calculate  EY(t) . 

b.  Use  (11.5.23)  to  determine  the  canonical  decomposition  of  |F(f)}. 

c.  Use  (11.5.25)  to  determine  the  autocovariance  function  of  {F(01- 

11.8  Use  (1 1.5.27)  to  verify  (1 1.5.28)  and  (1 1.5.29).  If  L(l)  ~  N(0,  a2),  use  (1 1.5.29) 
to  determine  the  distribution  of  h{t),  as  defined  by  (11.5.26). 


Random  Variables 

and  Probability  Distributions 


A.1  Distribution  Functions  and  Expectation 

A. 2  Random  Vectors 

A. 3  The  Multivariate  Normal  Distribution 


A.1  Distribution  Functions  and  Expectation 

The  distribution  function  F  of  a  random  variable  X  is  defined  by 

F(x)  =P[X<x ]  (A.  1.1) 

for  all  real  x.  The  following  properties  are  direct  consequences  of  (A.  1.1): 

1.  F  is  nondecreasing,  i.e.,  Fix)  <  F(y)  if  x  <  y. 

2.  F  is  right  continuous,  i.e.,  F(y)  \  F(x)  as  y  |  x. 

3.  F{x)  ->  1  and  F(y)  ->  0  as  x  ->  oo  and  y  ->  —  oo,  respectively. 

Conversely,  any  function  that  satisfies  properties  1-3  is  the  distribution  function  of 
some  random  variable. 

Most  of  the  commonly  encountered  distribution  functions  F  can  be  expressed 
either  as 

Fix)  =  f  fiy)dy  (A.  1.2) 

2—00 

or 

F(x)  =  y]  piXj),  (A.  1.3) 

j:xj<x 

where  {xo,  x\,  X2,  . . .}  is  a  finite  or  countably  infinite  set.  In  the  case  (A.  1.2)  we  shall 
say  that  the  random  variable  X  is  continuous.  The  function  /  is  called  the  probability 
density  function  (pdf)  of  X  and  can  be  found  from  the  relation 

f{x)  =  F'(x). 
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In  case  (A.  1.3),  the  possible  values  of  X  are  restricted  to  the  set  {xo,  x\,  . . .},  and  we 
shall  say  that  the  random  variable  X  is  discrete.  The  function p  is  called  the  probability 
mass  function  (pmf)  of  X ,  and  F  is  constant  except  for  upward  jumps  of  size  p(xf)  at 
the  points  Xj.  Thus  p(xf}  is  the  size  of  the  jump  in  F  at  xp  i.e., 

p(Xj)  =  F(Xj)  -  F(xJ )  =  P[X  =  Xj], 
where  F{xJ)  =  linw*.  F(y). 

J  J 


A.1 .1  Examples  of  Continuous  Distributions 

(a)  The  normal  distribution  with  mean  p  and  variance  cr2.  We  say  that  a  random 
variable  X  has  the  normal  distribution  with  mean  p  and  variance  a2  (written  more 
concisely  as  X  ^  a2))  if  X  has  the  pdf  given  by 

n  (x;  /x,  a2)  =  (2tt )— 1/2 cr~l /(2cj2)  —  oo  <  x  <  oo. 

It  follows  then  that  Z  =  (X  —  p)/ a  ~  A^(0,  1)  and  that 


P[X  <x]=P 


Z  < 


x  —  pi 
a 


5 


1  2 

where  O(x)  =  fx_  (27r)_1/2^_2-  dz  is  known  as  the  standard  normal  distribu- 

d  OO 

tion  function.  The  significance  of  the  terms  mean  and  variance  for  the  parameters 
pc  and  a2  is  explained  below  (see  Example  A.  1.1). 

(b)  The  uniform  distribution  on  [ a ,  b\.  The  pdf  of  a  random  variable  uniformly  dis¬ 
tributed  on  the  interval  [a,  b]  is  given  by 


u(x ;  a ,  b) 


[  1 

- ,  if  a  <  x  <  b 

)b  —  a 

0,  otherwise. 


(c)  The  exponential  distribution  with  parameter  X.  The  pdf  of  an  exponentially  dis¬ 
tributed  random  variable  with  parameter  A  >  0  is 


10,  if  x  <  0, 

Xe~Xx,  ifx>0. 

The  corresponding  distribution  function  is 


F(x)  = 


0, 


if  x  <  0, 
if  x  >  0. 


(d)  The  gamma  distribution  with  parameters  a  and  X.  The  pdf  of  a  gamma-distributed 
random  variable  is 

10,  if  x  <  0, 

x?-1Xae-kx/r(a ),  if  x  >  0, 

where  the  parameters  a  and  A  are  both  positive  and  T  is  the  gamma  function 
defined  as 


I»  = 


oo 


xa  le  x  dx. 
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Note  that /  is  the  exponential  pdf  when  a  =  1  and  that  when  a  is  a  positive  integer 
T(a)  =  (a  —  1)!  with  0!  defined  to  be  1. 

(e)  The  chi-squared  distribution  with  v  degrees  of  freedom.  For  each  positive  integer 
v,  the  chi-squared  distribution  with  v  degrees  of  freedom  is  defined  to  be  the 
distribution  of  the  sum 

x  =  zf  -i - yz\, 

where  Z\,  ...  ,ZV  are  independent  normally  distributed  random  variables  with 
mean  0  and  variance  1.  This  distribution  is  the  same  as  the  gamma  distribution 
with  parameters  a  =  v/2  and  X  =  j. 


A.1 .2  Examples  of  Discrete  Distributions 

(f)  The  binomial  distribution  with  parameters  n  and  p.  The  pmf  of  a  binomially 
distributed  random  variable  X  with  parameters  n  and  p  is 

b(j;  n,p )  =  P[X  =  j]  =  (fjpJV  ~P)n~J>  7  =  0,  1, ...  ,n, 

where  n  is  a  positive  integer  and  0  <  p  <  1. 

(g)  The  uniform  distribution  on  [1,2, ...  ,k}.  The  pmf  of  a  random  variable  X  uni¬ 
formly  distributed  on  {1,  2,  ...,&}  is 

p(j)=P[X=j]  =  j9  j  =  1,2 ...  ,k, 

k 

where  k  is  a  positive  integer. 

(h)  The  Poisson  distribution  with  parameter  X.  A  random  variable  X  is  said  to  have  a 
Poisson  distribution  with  parameter  X  >  0  if 

p(j;  A)  =  p\x  =j]  =  C-e-\  j  =  0,1, ... . 

/■ 

We  shall  see  in  Example  A.  1.2  below  that  X  is  the  mean  of  X. 

(i)  The  negative  binomial  distribution  with  parameters  a  and p.  The  random  variable 
X  is  said  to  have  a  negative  binomial  distribution  with  parameters  a  >  0  and 
p  e  [0,  1]  if  it  has  pmf 


nb(j;  a,p) 


k  —  1  + 


(i  -p)Jpa, 


where  the  product  is  defined  to  be  1  if  j  —  0. 


Not  all  random  variables  can  be  neatly  categorized  as  either  continuous  or  discrete. 
For  example,  consider  the  time  you  spend  waiting  to  be  served  at  a  checkout  counter 
and  suppose  that  the  probability  of  finding  no  customers  ahead  of  you  is  i.  Then  the 
time  you  spend  waiting  for  service  can  be  expressed  as 


1 

with  probability 

2 

1 

with  probability 

2 
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where  W\  is  a  continuous  random  variable.  If  the  distribution  of  W\  is  exponential  with 
parameter  1,  then  the  distribution  function  of  W  is 

10,  if  x  <  0, 

ii,  .  i 

-  +  -(l-e~x)  =  l--e-\  if  x  >  0. 

This  distribution  function  is  neither  continuous  (since  it  has  a  discontinuity  at  x  =  0) 
nor  discrete  (since  it  increases  continuously  for  x  >  0).  It  is  expressible  as  a  mixture , 

F  =pFd  + (l  —  p)Fc, 

with  p  =  y  of  a  discrete  distribution  function 


10,  x  <  0, 

1,  x  >  0, 

and  a  continuous  distribution  function 

0,  x  <  0, 


F*=\ 

[  1  -  e~x,  x  >  0. 

Every  distribution  function  can  in  fact  be  expressed  in  the  form 


F  =  p\Fd  +  p2Fc  +  p3Fsc, 

where  0  <  p\,p2, p?>  <  1,  p\  +  p2  +  P3  =  1,  Fd  is  discrete,  Fc  is  continuous,  and  Fsc 
is  singular  continuous  (continuous  but  not  of  the  form  A.  1.2).  Distribution  functions 
with  a  singular  continuous  component  are  rarely  encountered. 


A.1 .3  Expectation,  Mean,  and  Variance 


The  expectation  of  a  function  g  of  a  random  variable  X  is  defined  by 
E(g(X))  =  J  g(x)  dF (x) , 

where 


J  g(x)  dF(x )  := 


g(x)fix)  dx 


< 


OO 


X,  8(Xj)p(Xj) 

j= o 


in  the  continuous  case, 


in  the  discrete  case, 


and  g  is  any  function  such  that  E\g(x)  \  <  oo.  (If  F  is  the  mixture  F  =  pFc  +  (I  —  p)Fd, 
then  E(g(X))  =  p  f  g(x)  dFc(x )  +  (I  —  p)  j  g(x)  dFd(x).)  The  mean  and  variance  of 
X  are  defined  as  /z  =  EX  and  a2  —  E(X  —  /z)2,  respectively.  They  are  evaluated  by 
setting  g(x)  =  x  and  g(x)  =  (x  —  p)2  in  the  definition  of  E(g(X)). 

It  is  clear  from  the  definition  that  expectation  has  the  linearity  property 


E{aX  +  b)  =  aE(X )  +  b 


for  any  real  constants  a  and  b  (provided  that  E\X\  <  oo). 
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Example  A.1 .1  The  Normal  Distribution 

If  X  has  the  normal  distribution  with  pdf  n  (x;  fi,cr2)  as  defined  in  Example  (a)  above, 
then 

/OO  POO 

(x  —  /ji)n(x ;  /x,  a2)  dx  =  —a2  I  n  (x  :  /z,  o'2)  dx  —  0. 

-oo  J  —  OO 

This  shows,  with  the  help  of  the  linearity  property  of  E ,  that 
E(X)  =  /x, 

i.e.,  that  the  parameter  /x  is  in  fact  the  mean  of  the  normal  distribution  defined  in 
Example  (a).  Similarly, 

/OO  POO 

(x—  /ji)2n(x;  /x,  o'2)  dx  =  —a2  /  (x  —  ii)ri (x\  /x,  o'2)  dx. 

-oo  J  — oo 

Integrating  by  parts  and  using  the  fact  that  /  is  a  pdf,  we  find  that  the  variance  of  X  is 

/oo 

n(x ;  /z,  o'2)  dx  =  a2. 

-oo 

□ 


Example  A.1 .2 


The  Poisson  Distribution 

The  mean  of  the  Poisson  distribution  with  parameter  A  (see  Example  (h)  above)  is 
given  by 


oo 


oo 


M 


jM  -x  ^  1  -X 


-x  _  ^ 

~Ve  ~  ^  77  - 


=  1  =  A. 


J!  ^  1)1 

,/=^  7=1 

A  similar  calculation  shows  that  the  variance  is  also  equal  to  A  (see  Problem  A. 2). 


□ 


Remark.  Functions  and  parameters  associated  with  a  random  variable  X  will  be 
labeled  with  the  subscript  X  whenever  it  is  necessary  to  identify  the  particular  random 
variable  to  which  they  refer.  For  example,  the  distribution  function,  pdf,  mean,  and 
variance  of  X  will  be  written  as  Fx,  fx ,  dx,  and  cr|,  respectively,  whenever  it  is 
necessary  to  distinguish  them  from  the  corresponding  quantities  FY,  fy ,  Mr,  and 
associated  with  a  different  random  variable  Y. 


A.2  Random  Vectors 

An  ^-dimensional  random  vector  is  a  column  vector  X  =  (X\,  . . . ,  Xn )'  each  of  whose 
components  is  a  random  variable.  The  distribution  function  F  of  X,  also  called  the 
joint  distribution  of  X\, . . . ,  Xn,  is  defined  by 

Fix i,  ...,*„)=  P[X1?  <  vi,  . . . ,  Xn  <  vj  (A.2.1) 

for  all  real  numbers  x\, ...  ,xn.  This  can  be  expressed  in  a  more  compact  form  as 

F(x)  =  P[X  <  x],  x  =  (xi,  . . . ,  xn)f, 

for  all  real  vectors  x  =  (jci,  . . .  ,xn)f.  The  joint  distribution  of  any  subcollection 
Xtl,  ... ,  Xik  of  these  random  variables  can  be  obtained  from  F  by  setting  Xj  —  oo 
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in  (A.2.1)  for  all  j  £  {i\,  ... ,  4}-  In  particular,  the  distributions  of  X\  and  (X\,  Xn)!  are 
given  by 

FXl{x\)  =  P[Xi  <  *i]  =  F{xu  oo,  . . . ,  oo) 

and 


Fxuxn{xu  Xn)  =  P[X i  <  xu  Xn<  xn]  =  F(x i,  oo, . . . ,  oo,  xn). 

As  in  the  univariate  case,  a  random  vector  with  distribution  function  F  is  said  to  be 
continuous  if  F  has  a  density  function,  i.e.,  if 

/Xn  PX  2  PX\ 

■  /  f(yi  dyi  dy2 ■■■  dyn. 

-oo  J  —  oo  J  —  oo 

The  probability  density  of  X  is  then  found  from 


f  (A  1  9  *  *  *  9  Xfl ) 


dnF(xu  ...  ,xn) 


dx\  •  •  •  dxn 

The  random  vector  X  is  said  to  be  discrete  if  there  exist  real- valued  vectors  x0,  xi , . . . 
and  a  probability  mass  function  p(xj)  =  P[X  =  x;]  such  that 

oo 

J2p(xJ}  = L 

j= o 

The  expectation  of  a  function  g  of  a  random  vector  X  is  defined  by 
E  (g(X))  =  j  g(x)  dF(x)  =  J  g(x i,  . . . ,  x„)  dF(x\ ,  . . . ,  xn), 

where 

J  g(x  1 ,  . . . ,  xn)  dF(x  1  ,...,xn) 


hi 


g(xi ,  . . . ,  xn)  f{x i ,  . . . ,  xn)  dx i  •  •  •  dxn ,  in  the  continuous  case, 


E-E  g(xh xjn)p(xh xjn). 


in  the  discrete  case, 


j  i 


Jn 


and  g  is  any  function  such  that  £’|g(X)|  <  oo. 

The  random  variables  X\, . . . ,  Xn  are  said  to  be  independent  if 

P[X i  <  xu  . . . ,  Xn  <  xn ]  =  P[X i  <  xi]  •  •  •  <  xn ], 


i.e., 


F(xi,  . . . , =  FXl(x 0  •  •  •  FXn(xn) 

for  all  real  numbers  x\,  . . . ,  xn.  In  the  continuous  and  discrete  cases,  independence  is 
equivalent  to  the  factorization  of  the  joint  density  function  or  probability  mass  function 
into  the  product  of  the  respective  marginal  densities  or  mass  functions,  i.e., 

fix u  =fxl(x i)  •  •  -fxnixn)  (A.2.2) 

or 

...,xn)=  pXl(x i)  •  •  -Pxn{xn)-  (A. 2.3) 

For  two  random  vectors  X  =  (Xi, . . . ,  Xn)r  and  Y  =  (Fi,  . . . ,  Ym)'  with  joint 
density  function  /x,y  9  the  conditional  density  of  Y  given  X  =  x  is 
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/x,  y(x,  y) 

h  ix(y|x)  =  -  txW 

.My), 


I 


if/x(x)  >  0. 


if/x(x)  =  0. 


The  conditional  expectation  of  g  (Y)  given  X  =  x  is  then 


/ 


oo 


£(g(Y)|X  =  x)  =  g(y)/Y|x(y|x)  dy. 


oo 


If  X  and  Y  are  independent,  then/y|x(y|x)  —  f\( y)  by  (A.2.2),  and  so  the  conditional 
expectation  of  g( Y)  given  X  =  x  is 

E(g(Y)\X  =  x)=E(g(Y)), 

which,  as  expected,  does  not  depend  on  x.  The  same  ideas  hold  in  the  discrete  case 
with  the  probability  mass  function  assuming  the  role  of  the  density  function. 


A.2.1  Means  and  Covariances 

If  E\Xi\  <  oo  for  each  /,  then  we  define  the  mean  or  expected  value  of  X  = 
(X\,  ,  Xn)'  to  be  the  column  vector 

fix  =  EX  =  (EX i,  . . . ,  EXn)f . 

In  the  same  way  we  define  the  expected  value  of  any  array  whose  elements  are  random 
variables  (e.g.,  a  matrix  of  random  variables)  to  be  the  same  array  with  each  random 
variable  replaced  by  its  expected  value  (if  the  expectation  exists). 

If  X  =  (Xi ,  . . . ,  Xn)'  and  Y  =  (Y\ , . . . ,  Ym)'  are  random  vectors  such  that  each  Xt 
and  Yj  has  a  finite  variance,  then  the  covariance  matrix  of  X  and  Y  is  defined  to  be 
the  matrix 

EXy  =  Cov(X,  Y)  =  E[(X  -  EX)  (Y  -  EY)'] 

=  E(X Y')  -  (EX)(EY)'. 

The  (ij)  element  of  Exy  is  the  covariance  Cov(X;,  Yj)  —  E(XtYj)  —E(Xi)E(Yj).  In  the 
special  case  where  Y  =  X,  Cov(X,  Y)  reduces  to  the  covariance  matrix  of  the  random 
vector  X. 

Now  suppose  that  Y  and  X  are  linearly  related  through  the  equation 
Y  =  a  +  BX, 

where  a  is  an  m-dimensional  column  vector  and  B  is  an  m  x  n  matrix.  Then  Y  has 
mean 


EY  =  a  +  BEX  (A.2.4) 

and  covariance  matrix 

Syy  =  BYtxxB'  (A. 2.5) 

(see  Problem  A. 3). 

Proposition  A.2.1  The  covariance  matrix  Exx  of  a  random  vector  X  is  symmetric  and  nonnegative 

definite ,  i.e.,  IfExxb  >  0  for  all  vectors  b  =  (b\,  . . . ,  bn)f  with  real  components. 

Proof  Since  the  (/,  j)  element  of  Exx  is  Cov(X/,  Xj)  =Co v(Xj,  Xt),  it  is  clear  that  Exx  is 

symmetric.  To  prove  nonnegative  definiteness,  let  b  =  (b i,  . . . ,  bn)'  be  an  arbitrary 
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vector.  Then  applying  (A.2.5)  with  a  =  0  and  B  —  b,  we  have 

b£xxb  =  VarCb'X)  =  Var (bxXx  +  •  •  •  +  bnXn)  >0.  ■ 


Proposition  A.2.2  Every  n  x  n  covariance  matrix  £  can  be  factorized  as 

£  =  PAP' 


where  P  is  an  orthogonal  matrix  (i.e.,  P'  —  P~[)  whose  columns  are  an  orthonormal 
set  of  right  eigenvectors  corresponding  to  the  ( nonnegative )  eigenvalues  X\ ,  ,kn  of 
£,  and  A  is  the  diagonal  matrix 


At 

0 


0 

^2 


0 

0 


0  0  • • •  Xn 


In  particular, 


£  is  nonsingular  if  and  only  if  all  the  eigenvalues  are  strictly  positive. 


Proof  Every  covariance  matrix  is  symmetric  and  nonnegative  definite  by  Proposition  A.2.1, 
and  for  such  matrices  the  specified  factorization  is  a  standard  result  (see  Graybill  1983 
for  a  proof).  The  determinant  of  an  orthogonal  matrix  is  1  or  —1,  so  that  det(£)  = 
det(P)  det(A)  det(P)  =  •  •  •  kn.  It  follows  that  £  is  nonsingular  if  and  only  if  >  0 

for  all  /.  ■ 


Remark  1.  Given  a  covariance  matrix  £,  it  is  sometimes  useful  to  be  able  to  find  a 
square  root  A  =  £^2  with  the  property  that  AN  —  £.  It  is  clear  from  Proposition 
A.2.2  and  the  orthogonality  of  P  that  one  such  matrix  is  given  by 

A  =  £1/2  =PA1/2P'. 

If  E  is  nonsingular,  then  we  can  define 

£v  =  PASP\  — oo  <  s  <  oo. 

The  matrix  £-1/2  defined  in  this  way  is  then  a  square  root  of  £ 
of  £1/2. 


1  and  also  the  inverse 

□ 


A.3  The  Multivariate  Normal  Distribution 

The  multivariate  normal  distribution  is  one  of  the  most  commonly  encountered  and 
important  distributions  in  statistics.  It  plays  a  key  role  in  the  modeling  of  time  series 
data.  Let  X  =  (X\,  . . . ,  Xnf  be  a  random  vector. 


Definition  A.3.1 


X  has  a  multivariate  normal  distribution  with  mean  pt  and  nonsingular  covari¬ 
ance  matrix  £  =  £xx,  written  as  X  ~  N(jn,  £),  if 


/x(x)  =  ( 2n)  "/2(det  £)  1/zexp 


1/2 


i(x -/*.)'£  1  (x  -  ii) 
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Example  A.3.1. 


If  X  ~  N (/i,  £),  we  can  define  a  standardized  random  vector  Z  by  applying  the 
linear  transformation 


Z  =  ET1/2(X -//.), 


(A.3.1) 


where  E-1/2  is  defined  as  in  the  remark  of  Section  A.2.  Then  by  (A.2.4)  and  (A.2.5), 
Z  has  mean  0  and  covariance  matrix  Ezz  =  E~I/2Z  E~l/2  =  /„.  where  In  is  the  n  x  n 
identity  matrix.  Using  the  change  of  variables  formula  for  probability  densities  (see 
Mood  et  al.  1974),  we  find  that  the  probability  density  of  Z  is 


/z(z) 


(det  S)1/2/x  (E1/2z  +  fi) 

(det  E)1/2(27r)_"/2(det  E)_1/2  exp  - 


l(IT1/2z)'E_1E_1/2z 

2 


=  (2n) 


—n/2 


I 


showing,  by  (A.2.2),  that  Zi,  . . . ,  Zn  are  independent  N(0,  1)  random  variables.  Thus 
the  standardized  random  vector  Z  defined  by  (A.3.1)  has  independent  standard  normal 
random  components.  Conversely,  given  any  n  x  1  mean  vector  fi ,  a  nonsingular  n  x  n 
covariance  matrix  £,  and  an  n  x  1  vector  of  standard  normal  random  variables,  we  can 
construct  a  normally  distributed  random  vector  with  mean  fi  and  covariance  matrix  £ 
by  defining 

X=E1/2Z  +  11.  (A. 3.2) 


(See  Problem  A.4.) 


Remark  1.  The  multivariate  normal  distribution  with  mean  / 1  and  covariance  matrix 
£  can  be  defined,  even  when  £  is  singular,  as  the  distribution  of  the  vector  X  in  (A.3.2). 
The  singular  multivariate  normal  distribution  does  not  have  a  joint  density,  since 
the  possible  values  of  X  —  fi  are  constrained  to  lie  in  a  subspace  of  W1  with  dimension 
equal  to  rank(  £ ) .  □ 


Remark  2.  If  X  £),  B  is  an  m  x  n  matrix,  and  a  is  a  real  m  x  1  vector,  then 

the  random  vector 

Y  =  a  +  BX 

is  also  multivariate  normal  (see  Problem  A. 5).  Note  that  from  (A.2.4)  and  (A.2.5),  Y 
has  mean  a  +  B/jl  and  covariance  matrix  B'EB'.  In  particular,  by  taking  B  to  be  the 
row  vector  b'  =  (b\,  . . . ,  bn ),  we  see  that  any  linear  combination  of  the  components 
of  a  multivariate  normal  random  vector  is  normal.  Thus  b'X  =  b\X \  +  •  •  •  +  bnXn  ~ 
N(br^x,  br£xxb).  D 


The  Bivariate  Normal  Distribution 

Suppose  that  X  =  (Xi,^)'  is  a  bivariate  normal  random  vector  with  mean  fi  — 
(til,  fi i)’  and  covariance  matrix 


of  pcr\02 

f)(T\(T  2  \  ’ 


CT  >  0,  CT2  >  0,  —  1  <  p  <  1 . 


(A.3.3) 
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Proposition  A.3.1. 


Example  A.3.2. 


The  parameters  o\ ,  <72,  and  p  are  the  standard  deviations  and  correlation  of  the  compo¬ 
nents  X\  and^.  Every  nonsingular  2-dimensional  covariance  matrix  can  be  expressed 
in  the  form  (A.3.3).  The  inverse  of  E  is 


s-Mi-p2) 


-1 


ai_ 

—  pGflGfl 


pGx  lG?  1 

aT2 


and  so  the  pdf  of  X  is  given  by 


/x(x)  =  [27TGiG2  (1  -  P2) 


2\V2n 


x  exp  • 


-1 


2(1  -p2) 


X\  —  /X 1 


Multivariate  normal  random  vectors  have  the  important  property  that  the  condi¬ 
tional  distribution  of  any  set  of  components,  given  any  other  set,  is  again  multivariate 
normal.  In  the  following  proposition  we  shall  suppose  that  the  nonsingular  normal 
random  vector  X  is  partitioned  into  two  subvectors 


-X(D- 

X(2> 


Correspondingly,  we  shall  write  the  mean  and  covariance  matrix  of  X  as 


/x 

/x 


(1) ' 

(2) 


and 


^11  ^12 
^21  ^22 


where  /xw  =  EX^  and  E#  =  E  (Xw  —  /xw)  (X^  —  . 


i.  X(1)  and  X(2)  are  independent  if  and  only  if^n  =  0. 

ii.  The  conditional  distribution  ofX (1)  given  X(2)  =  x(2)  is  N(/x(1)  +  Ei2E^(x^  — 
/x(2)),  Eh  —  E12 E^/  E21).  In  particular, 

E(X(1)|X(2)  =  x(2))  =  /x(1)  +  E^E^1  (x(2)  -  /x(2)) . 

The  proof  of  this  proposition  involves  routine  algebraic  manipulations  of  the 
multivariate  normal  density  function  and  is  left  as  an  exercise  (see  Problem  A. 6). 


For  the  bivariate  normal  random  vector  X  in  Example  A.3.1,  we  immediately  deduce 
from  Proposition  A.3.1  that  Xi  and  X2  are  independent  if  and  only  if  pG\G2  =  0  (or 
p  —  0,  since  o\  and  g2  are  both  positive).  The  conditional  distribution  of  X\  given 
X2  —  X2  is  normal  with  mean 

E(X i\X2  =  x2)  =  /Xi  +  pO'i<72_1(X2  -  /x2) 
and  variance 


Var(Xi|X2  =  x2)  =  g2  (l  -  p2) . 


□ 
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Definition  A.3.2. 


{Xt}  is  a  Gaussian  time  series  if  all  of  its  joint  distributions  are  multivariate  normal, 
i.e.,  if  for  any  collection  of  integers  i\,  ... ,  in,  the  random  vector  (X/, ,  . . . ,  X[n )'  has 
a  multivariate  normal  distribution. 


Remark  3.  If  {Xr}  is  a  Gaussian  time  series,  then  all  of  its  joint  distributions  are 
completely  determined  by  the  mean  function  fi(t)  =  EXt  and  the  autocovariance 
function  k(s ,  t)  —  Cov(Xs,Xt).  If  the  process  also  happens  to  be  stationary,  then 
the  mean  function  is  constant  (fit  =  fi  for  all  t)  and  K(t  +  h,t)  =  y(h)  for  all  t. 
In  this  case,  the  joint  distribution  of  X\, . . . ,  Xn  is  the  same  as  that  of  X\+^  •  •  • ,  Xn+h 
for  all  integers  h  and  n  >  0.  Hence  for  a  Gaussian  time  series  strict  stationarity  is 
equivalent  to  weak  stationarity  (see  Section  2.1).  □ 


Problems 


A.l  Let  X  have  a  negative  binomial  distribution  with  parameters  a  and p ,  where  a  >  0 
and  0  <  p  <  1 . 

a.  Show  that  the  probability  generating  function  of  X  (defined  as  M(s)  —  £’(lyx)) 
is 

M(s)  =pa(l-s  +  sp)~a ,  0<s  <1. 

b.  Using  the  property  that  M\  1)  =  E(X)  and  M"(  1)  =  E(X2)  —  E(X),  show  that 

E(X)  =  a(l  —  p)/p  and  Var(X)  =  a(l  —  p)/p2  • 

A. 2  If  X  has  the  Poisson  distribution  with  mean  A,  show  that  the  variance  of  X  is  also  A. 

A.3  Use  the  linearity  of  the  expectation  operator  for  real- valued  random  variables  to 
establish  (A.2.4)  and  (A.2.5). 

A.4  If  E  is  an  n  x  n  covariance  matrix,  E1//2  is  the  square  root  of  E  defined  in  the 
remark  of  Section  A.2,  and  Z  is  an  w-vector  whose  components  are  independent 
normal  random  variables  with  mean  0  and  variance  1,  show  that 

X  =  S1/2Z  +  fi 

is  a  normally  distributed  random  vector  with  mean  fi  and  covariance  matrix  £ . 

A.5  Show  that  if  X  is  an  ^-dimensional  random  vector  such  that  X  ~  NQx,  X),  B  is 
a  real  m  x  n  matrix,  and  a  is  a  real- valued  m-vector,  then 

Y  =  a  +  BX 

is  a  multivariate  normal  random  vector.  Specify  the  mean  and  covariance  matrix 
of  Y. 

A.6  Prove  Proposition  A.3. 1. 

A.7  Suppose  that  X  =  (X\,  . . .  ,Xn)'  ~  N(0,  E)  with  X  nonsingular.  Using  the 
fact  that  Z,  as  defined  in  (A.3. 1),  has  independent  standard  normal  components, 
show  that  (X  —  /*,yX_1(X  —  fi)  has  the  chi-squared  distribution  with  n  degrees 
of  freedom  (Section  A.l,  Example  (e)). 
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A.8  Suppose  that  X  =  (X\ , . . .  ,Xn)'  ~  N(/t,  E)  with  E  nonsingular.  If  A  is  a 
symmetric  nxn  matrix,  show  that  ^(X'AX)  =  trace  (A  E)  +  /t'E/t. 

A. 9  Suppose  that  {XJ  is  a  stationary  Gaussian  time  series  with  mean  0  and  autoco¬ 
variance  function  y(h).  Find  E(Xt\Xs)  and  Var(Xr|X5),  s  ^  t. 


Statistical  Complements 


B.1  Least  Squares  Estimation 
B.2  Maximum  Likelihood  Estimation 
B.3  Confidence  Intervals 
B.4  Hypothesis  Testing 


B.1  Least  Squares  Estimation 

Consider  the  problem  of  finding  the  “best”  straight  line 
y  =  o o  +  0\X 

to  approximate  observations  yi,  . . . ,  yn  of  a  dependent  variable  y  taken  at  fixed  values 

A 

jci,  . . . ,  xn  of  the  independent  variable  x.  The  (ordinary)  least  squares  estimates  9o , 

A 

6\  are  defined  to  be  values  of  9o ,  9\  that  minimize  the  sum 

n 

S(6o,  0i)  =  -do-  elXi )2 

i=\ 

of  squared  deviations  of  the  observations  yt  from  the  fitted  values  9o  +  9\Xi.  (The 
“sum  of  squares”  S(6o,  9\ )  is  identical  to  the  Euclidean  squared  distance  between  y 
and  0q1  +  ^ix?  i-e-? 

S(0q,  d\)  =  Hy  —  9q1  —  ^ix||2, 

where  x  =  (x\,  . . . ,  xn 1  =  (1, . . . ,  l)r,  and  y  =  (y i,  . . . ,  yn)' •)  Setting  the  partial 
derivatives  of  S  with  respect  to  9q  and  9\  both  equal  to  zero  shows  that  the  vector 

/v  A  A 

6  =  (#o,  OX  satisfies  the  “normal  equations” 

X’XO  =  X'y, 
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Example  B.1.1. 


where  X  is  the  n  x  2  matrix  X  =  [l,x].  Since  0  <  S(0 )  and  S(0 )  ->  oc  as  \\0 1|  ->  oc, 

-(1)  -(2) 

the  normal  equations  have  at  least  one  solution.  If  0  and  0  are  two  solutions  of  the 
normal  equations,  then  a  simple  calculation  shows  that 


X'X  (0 


(i) 


V 


-(1)  -(2) 

i.e.,  that  X0  =  X0  .  The  solution  of  the  normal  equations  is  unique  if  and  only  if 
the  matrix  X'X  is  nonsingular.  But  the  preceding  calculations  show  that  even  if  X'X  is 

/V  /V 

singular,  the  vector  y  =  X0  of  fitted  values  is  the  same  for  any  solution  0  of  the  normal 
equations. 

The  argument  just  given  applies  equally  well  to  least  squares  estimation  for  the 
general  linear  model.  Given  a  set  of  data  points 


O/i,  xn,  . . . ,  xim ,  y/),  i  =  1,  . . . ,  n  with  m  <  n, 

/V  V  f 

the  least  squares  estimate,  0  =  (6 1,  . . . ,  6m)  of  0  =  (60  ... ,  Om)'  minimizes 


n 


^  0\^i\  '  '  '  )  —  y  0\\ 


■(1) 


i=\ 


2 


where  y  =  (y  1,  . . . ,  yn)'  and  x(7  )  =  Oy,  . . . ,  xnj)\j  =  1,  . . . ,  m.  As  in  the  previous 

A 

special  case,  0  satisfies  the  equations 
XrX0  = 


where  X  is  the  n  x  m  matrix  X  =  [x(1),  . . . ,  x(m)].  The  solution  of  this  equation  is 
unique  if  and  only  if  X'X  nonsingular,  in  which  case 

0  =  (X'X)~lX'y. 

A 

If  X'X  is  singular,  there  are  infinitely  many  solutions  0,  but  the  vector  of  fitted  values 

/V 

xe  is  the  same  for  all  of  them. 


To  illustrate  the  general  case,  let  us  fit  a  quadratic  function 
y  =  6 0  +  6\x  +  Ojx" 
to  the  data 


x  0  12  3  4 

y  1  0  3  5  8 

The  matrix  X  for  this  problem  is 


"1  0  0" 
1  1  1 
1  2  4 
1  3  9 
1  4  16 


,  giving  (X'X)  1 


1 

140 


124 

-108 

20 

-108 

174 

-40 

20 

-40 

10 

,  /V  /V  /'Vs/ 

The  least  squares  estimate  0  =  (60  0,  #2)  is  therefore  unique  and  given  by 


0  =  (X'X)-'X'y  = 


0.6 

-0.1 

0.5 
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The  vector  of  fitted  values  is  given  by 

y  =  X0  =  (0.6,  1,2.4,  4.8,  8.2)' 
as  compared  with  the  observed  values 
y  =  (1,0,  3,  5,8/. 


□ 


B.1 .1  The  Gauss-Markov  Theorem 


Suppose  now  that  the  observations  yi,  . . . ,  yn  are  realized  values  of  random  variables 
Y\, ...  ,Yn  satisfying 

Yi  =  6\Xi\  +  •  •  •  +  0mXim  +  Zi, 

where  Z,  —  WN  (0,  a2).  Letting  Y  =  (Y\, ... ,  Yn)'  and  Z  =  (Z\,  ...  ,  Zn )',  we  can 
write  these  equations  as 

Y  =  X0  +  Z. 


Assume  for  simplicity  that  the  matrix  X'X  is  nonsingular  (for  the  general  case  see,  e.g., 
Silvey  1975).  Then  the  least  squares  estimator  of  0  is,  as  above, 

o  =  (x'xr'x’Y, 

and  the  least  squares  estimator  of  the  parameter  a2  is  the  unbiased  estimator 


n  —  m 


It  is  easy  to  see  that  0  is  also  unbiased,  i.e.,  that 
E{e)  =o. 

It  follows  at  once  that  if  c'0  is  any  linear  combination  of  the  parameters  i  = 

A 

1,  . . . ,  m,  then  c'0  is  an  unbiased  estimator  of  c'0.  The  Gauss-Markov  theorem  says 

_  A 

that  of  all  unbiased  estimators  of  c'0  of  the  form  YHi=\  the  estimator  c'0  has  the 
smallest  variance. 

In  the  special  case  where  Z\ ,  . . . ,  Zn  are  IID  N(0,  a2) ,  the  least  squares  estimator  0 
has  the  distribution  N(0,  cr2(X'X)~{ ),  and  ( n  —  m)a2 /a2  has  the  x2  distribution  with 
n  —  m  degrees  of  freedom. 


B.1 .2  Generalized  Least  Squares 

The  Gauss-Markov  theorem  depends  on  the  assumption  that  the  errors  Z\ ,  . . . ,  Zn  are 
uncorrelated  with  constant  variance.  If,  on  the  other  hand,  Z  =  (Zi,  . . . ,  Zn )'  has  mean 
0  and  nonsingular  covariance  matrix  <j2E  where  E  //,  we  consider  the  transformed 
observation  vector  U  =  R~l  Y,  where  R  is  a  nonsingular  matrix  such  that  RR'  =  E. 
Then 

U  =  R~lX0  +  W  =  M0  +  W, 

where  M  =  R~[X  and  W  has  mean  0  and  covariance  matrix  a2/.  The  Gauss-Markov 

A 

theorem  now  implies  that  the  best  linear  estimate  of  any  linear  combination  c'0  is  c'0, 
where  0  is  the  generalized  least  squares  estimator,  which  minimizes 
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In  the  special  case  where  Z\ , . . . ,  Zn  are  uncorrelated  and  Z,  has  mean  0  and  variance 
c>2r2,  the  generalized  least  squares  estimator  minimizes  the  weighted  sum  of  squares 


In  the  general  case,  if  X'X  and  £  are  both  nonsingular,  the  generalized  least  squares 
estimator  is  given  by 

0  =  J. 

Although  the  least  squares  estimator  (X'X)~lX'Y  is  unbiased  if  E(Z)  =  0,  even  when 
the  covariance  matrix  of  Z  is  not  equal  to  a2 1,  the  variance  of  the  corresponding 
estimate  of  any  linear  combination  of  6 1 ,  . . . ,  6m  is  greater  than  or  equal  to  the 
estimator  based  on  the  generalized  least  squares  estimator. 


B.2  Maximum  Likelihood  Estimation 

The  method  of  least  squares  has  an  appealing  intuitive  interpretation.  Its  application 
depends  on  knowledge  only  of  the  means  and  covariances  of  the  observations.  Maxi¬ 
mum  likelihood  estimation  depends  on  the  assumption  of  a  particular  distributional 
form  for  the  observations,  known  apart  from  the  values  of  parameters  9\, ... ,  6m. 
We  can  regard  the  estimation  problem  as  that  of  selecting  the  most  appropriate 
value  of  a  parameter  vector  0,  taking  values  in  a  subset  0  of  Mm.  We  suppose  that 
these  distributions  have  probability  densities  p(x;  0),  0  e  0.  For  a  fixed  vector  of 
observations  x,  the  function  L{0)  =  p(x ;  0)  on  0  is  called  the  likelihood  function.  A 

A 

maximum  likelihood  estimate  0  (x)  of  0  is  a  value  of  0  e  0  that  maximizes  the  value 
of  L{0)  for  the  given  observed  value  x,  i.e., 

L(0)  =  p(x ;  0(x ))  =  ma xp(x;  0). 


Example  B.2.1.  If  x  =  (jci,  . . .  ,xn)'  is  a  vector  of  observations  of  independent  N(/z,  a2)  random 

variables,  the  likelihood  function  is 


L(/x,  a2) 


I 


(2 


;rcr2) 


n/2 


exp 


1  n 

— -  m)2 

i=\ 


,  — oo  <  fi  <  oo,  a  >  0. 


Maximization  of  L  with  respect  to  pc  and  a  is  equivalent  to  minimization  of 

1  n 

—2 In L  (/x,  a2)  =  n\n(2n)  +  2n\n(c>)  4 - -  —  p)2 . 


Setting  the  partial  derivatives  of  — 21nL  with  respect  to  p  and  a  both  equal  to  zero 
gives  the  maximum  likelihood  estimates 


B.3  Confidence  Intervals 
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B.2.1  Properties  of  Maximum  Likelihood  Estimators 


The  Gauss-Markov  theorem  lent  support  to  the  use  of  least  squares  estimation  by 
showing  its  property  of  minimum  variance  among  unbiased  linear  estimators.  Maxi¬ 
mum  likelihood  estimators  are  not  generally  unbiased,  but  in  particular  cases  they  can 
be  shown  to  have  small  mean  squared  error  relative  to  other  competing  estimators. 
Their  main  justification,  however,  lies  in  their  good  large-sample  behavior. 

For  independent  and  identically  distributed  observations  with  true  probability 
density  p(-\  Oq)  satisfying  certain  regularity  conditions,  it  can  be  shown  that  the 

A 

maximum  likelihood  estimator  0  of  0o  converges  in  probability  to  0o  and  that  the 
distribution  of  *fn(0  —  0 o)  is  approximately  normal  with  mean  0  and  covariance  matrix 
1(0 o)-1,  where  1(0)  is  Fisher’s  information  matrix  with  (ij)  component 


dlnp(X;0)  dlnp(X-0) 
d6i  dOj 


In  time  series  analysis  the  situation  is  rather  more  complicated  than  in  the  case 
of  iid  observations.  “Likelihood”  in  the  time  series  context  is  almost  always  used  in 
the  sense  of  Gaussian  likelihood,  i.e.,  the  likelihood  computed  under  the  (possibly 
false)  assumption  that  the  series  is  Gaussian.  Nevertheless,  estimators  of  ARMA 
coefficients  computed  by  maximization  of  the  Gaussian  likelihood  have  good  large- 
sample  properties  analogous  to  those  described  in  the  preceding  paragraph.  For  details 
see  Brockwell  and  Davis  (1991),  Section  10.8. 


B.3  Confidence  Intervals 


Estimation  of  a  parameter  or  parameter  vector  by  least  squares  or  maximum  likelihood 
leads  to  a  particular  value,  often  referred  to  as  a  point  estimate.  It  is  clear  that  this 
will  rarely  be  exactly  equal  to  the  true  value,  and  so  it  is  important  to  convey  some 
idea  of  the  probable  accuracy  of  the  estimator.  This  can  be  done  using  the  notion  of 
confidence  interval,  which  specifies  a  random  set  covering  the  true  parameter  value 
with  some  specified  (high)  probability. 


Example  B.3.1 .  If  X  =  (X\, . . . ,  Xn)'  is  a  vector  of  independent  N(/x,  a2)  random  variables,  we  saw 

in  Section  B.2  that  the  random  variable  Xn  =  ^  the  maximum  likelihood 

estimator  of  /x.  This  is  a  point  estimator  of  /x.  To  construct  a  confidence  interval  for  /x 
from  Xn,  we  observe  that  the  random  variable 

Xn  /x 

s/y/a 

has  Student’s  ^-distribution  with  n  —  1  degrees  of  freedom,  where  S  is  the  sample 
standard  deviation,  i.e.,  S2  =  ^  YH=  i  (X  —  Xn)  •  Hence, 


P 


~t  l-a/2 


<  h-a/2 


where  t \-a/i  denotes  the  (1  —  a/2)  quantile  of  the  ^-distribution  with  n—  1  degrees  of 
freedom.  This  probability  statement  can  be  expressed  in  the  form 
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p[xn  -  ti-a/2S/Vn  <  n  <Xn  +  t\_a/2S/  *Jn  ]  =  1  -  a, 

which  shows  that  the  random  interval  bounded  by  Xn  d=  h-a/2S/^/n  includes  the  true 
value  /z  with  probability  1  —  a.  This  interval  is  called  a  (1  —  a)  confidence  interval 
for  the  mean  /z. 

□ 


B.3.1  Large-Sample  Confidence  Regions 

Many  estimators  of  a  vector-valued  parameter  0  are  approximately  normally  dis¬ 
tributed  when  the  sample  size  n  is  large.  For  example,  under  mild  regularity  conditions, 

A 

the  maximum  likelihood  estimator  0(X)  of  0  =  (0i,  . . . ,  0m)f  is  approximately 
N(0,  !/(»)->),  where  1(0)  is  the  Fisher  information  defined  in  Section  B.2.  Conse¬ 
quently, 

n(0-0)l{0)(0-0) 

is  approximately  distributed  as  x2  with  m  degrees  of  freedom,  and  the  random  set  of 
0 -values  defined  by 

n(e-e)'i(e){e-o)<xla(m) 

covers  the  true  value  of  0  with  probability  approximately  equal  to  1  —  a. 


Example  B.3.2. 


For  iid  observations  X\, ... ,  Xn  from  N(/z,  a2),  a  straightforward  calculation  gives, 
for  0  =  (/z,  cr2)\ 


Thus  we  obtain  the  large-sample  confidence  region  for  (/z,  a2)\ 

n(fi-  Xnf  /a2  +  n(a2  -  a2)2/  (2a4)  <  Xi-a( 2), 

which  covers  the  true  value  of  0  with  probability  approximately  equal  to  1  —  a .  This 
region  is  an  ellipse  centered  at  (X„,  a2). 

□ 


B.4  Hypothesis  Testing 

Parameter  estimation  can  be  regarded  as  choosing  one  from  infinitely  many  possible 
decisions  regarding  the  value  of  a  parameter  vector  0.  Hypothesis  testing,  on  the  other 
hand,  involves  a  choice  between  two  alternative  hypotheses,  a  “null”  hypothesis  H0 
and  an  “alternative”  hypothesis  Hi,  regarding  the  parameter  vector  0.  The  hypotheses 
H0  and  Hi  correspond  to  subsets  ®o  and  ©i  of  the  parameter  set  0.  The  problem 
is  to  decide,  on  the  basis  of  an  observed  data  vector  X,  whether  or  not  we  should 
reject  the  null  hypothesis  H0.  A  statistical  test  of  H0  can  therefore  be  regarded  as  a 
partition  of  the  sample  space  into  one  set  of  values  of  X  for  which  we  reject  H0  and 
another  for  which  we  do  not.  The  problem  is  to  specify  a  test  (i.e.,  a  subset  of  the 
sample  space  called  the  “rejection  region”)  for  which  the  corresponding  decision  rule 
performs  well  in  practice. 


B.4  Hypothesis  Testing 
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Example  B.4.1. 


If  X  =  (X\, ,  Xn)'  is  a  vector  of  independent  N(/x,  1)  random  variables,  we  may 
wish  to  test  the  null  hypothesis  Hq :  /x  =  0  against  the  alternative  Hi:  /x  ^  0.  A 
olausible  choice  of  rejection  region  in  this  case  is  the  set  of  all  samples  X  for  which 
Xn  >  c  for  some  suitably  chosen  constant  c.  We  shall  return  to  this  example  after 
considering  those  factors  that  should  be  taken  into  account  in  the  systematic  selection 
of  a  “good”  rejection  region. 

□ 


B.4.1  Error  Probabilities 

There  are  two  types  of  error  that  may  be  incurred  in  the  application  of  a  statistical 
test: 

•  type  I  error  is  the  rejection  of  Hq  when  it  is  true. 

•  type  II  error  is  the  acceptance  of  H0  when  it  is  false. 

For  a  given  test  (i.e.,  for  a  given  rejection  region  R ),  the  probabilities  of  error  can  both 
be  found  from  the  power  function  of  the  test,  defined  as 

Po(R),  0  e&, 

where  P#  is  the  distribution  of  X  when  the  true  parameter  value  is  0 .  The  probabilities 
of  a  type  I  error  are 

a(0)=Po(R ),  0  e  0O, 

and  the  probabilities  of  a  type  II  error  are 

P($)  =  l-Pe(R),  0e@x. 

It  is  not  generally  possible  to  find  a  test  that  simultaneously  minimizes  a(0)  and  j3(0) 
for  all  values  of  their  arguments.  Instead,  therefore,  we  seek  to  limit  the  probability  of 
type  I  error  and  then,  subject  to  this  constraint,  to  minimize  the  probability  of  type  II 
error  uniformly  on@i.  Given  a  significance  level  a,  an  optimum  level-a  test  is  a  test 
satisfying 

a  (0)  <  a,  for  all  0  e  ©o, 

that  minimizes  /3(0)  for  every  0  e  ©\ .  Such  a  test  is  called  a  uniformly  most  powerful 
(U.M.P.)  test  of  level  a.  The  quantity  sup0G@o  a(0)  is  called  the  size  of  the  test. 

In  the  special  case  of  a  simple  hypothesis  vs.  a  simple  hypothesis,  e.g.,  Hq:  0  =  0o 
vs.  Hi:  0=0\,  an  optimal  test  based  on  the  likelihood  ratio  statistic  can  be  constructed 
(see  Silvey  1975).  Unfortunately,  it  is  usually  not  possible  to  find  a  uniformly  most 
powerful  test  of  a  simple  hypothesis  against  a  composite  (more  than  one  value  of  0) 
alternative.  This  problem  can  sometimes  be  solved  by  searching  for  uniformly  most 
powerful  tests  within  the  smaller  classes  of  unbiased  or  invariant  tests.  For  further 
information  see  Lehmann  (1986). 


B.4.2  Large-Sample  Tests  Based  on  Confidence  Regions 

There  is  a  natural  link  between  the  testing  of  a  simple  hypothesis  H0:  0  =  0o 
vs.  Hi:  0  7^  0O  and  the  construction  of  confidence  regions.  To  illustrate  this 
connection,  suppose  that  0  is  an  estimator  of  0  whose  distribution  is  approximately 
N(0,n_1/_1(0)),  where  1(0)  is  a  positive  definite  matrix.  This  is  usually  the  case,  for 

✓V 

example,  when  0  is  a  maximum  likelihood  estimator  and  1(0)  is  the  Fisher  information. 
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Example  B.4.2. 


As  in  Section  B.3.1,  we  have 

Pe(n(0  -  0)7(0)  (0  -9)  <  xf-Jm))  ~  1  -  a. 

Consequently,  an  approximate  a-level  test  is  to  reject  Ho  if 

n(9 o  -  0)7(0)  (0O  -  0)  >  Xi~a  (m)’ 

or  equivalently,  if  the  confidence  region  determined  by  those  0’s  satisfying 

n(0-  0)7(0) (9  -  9)  <  xla(m) 

does  not  include  0 q. 

Consider  again  the  problem  described  in  Example  B.4.1.  Since  Xn  ~N(/x,  ft-1),  the 
hypothesis  H0:  /x  =  0  is  rejected  at  level  a  if 

n(Xn)  >  Xi-a,v 
or  equivalently,  if 

-  ^  d>!_a/2 

^  n  ^  in' 

nL/z 

□ 


Mean  Square  Convergence 


C.1  The  Cauchy  Criterion 


The  sequence  Sn  of  random  variables  is  said  to  converge  in  mean  square  to  the  random 
variable  S  if 

E(Sn  —  S)2  ->  0  as  n  ->  oo. 

In  particular,  we  say  that  the  sum  YH=\  %k  converges  (in  mean  square)  if  there  exists 

a  random  variable  S  such  that  E(J2k=i  Xk  —  S)2  ^  0  as  n  ->  oo.  If  this  is  the  case, 
then  we  use  the  notation  S  =  YlkLi  %k- 


C.1  The  Cauchy  Criterion 

For  a  given  sequence  Sn  of  random  variables  to  converge  in  mean  square  to  some 
random  variable,  it  is  necessary  and  sufficient  that 

E(Sm  —  Sn )2  — ^  0  as  in,  n  ->  oc 

(for  a  proof  of  this  see  Brockwell  and  Davis  (1991),  Chapter  2).  The  point  of  the 
criterion  is  that  it  permits  checking  for  mean  square  convergence  without  having  to 
identify  the  limit  of  the  sequence. 

Example  C.1 .1 .  Consider  the  sequence  of  partial  sums  Sn  =  Y^t=-n  at^t^  n  —  1,2,...,  where  {Zt}  ~ 

WN  (0,  a2).  Under  what  conditions  on  the  coefficients  at  does  this  sequence  converge 
in  mean  square?  To  answer  this  question  we  apply  the  Cauchy  criterion  as  follows.  For 
n  >  m  >  0, 
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E(Sn  -  Smf  =  E  [  J2  aiZi\  =°2  J2 


a]. 


m<\i\<n 


m<\i\<n 


Consequently,  E(Sn  —  Sm )2  — ^  0  if  and  only  if  J2m<\i\<nai  0-  Since  the  Cauchy 
criterion  applies  also  to  real-valued  sequences,  this  last  condition  is  equivalent  to 
convergence  of  the  sequence  YH=-n  ah  or  equivalently  to  the  condition 


oo 

/  =  — OG 


(C.1.1) 

□ 


Properties  of  Mean  Square  Convergence: 

If  Xn  — >  X  and  Y„  — >  Y,  in  mean  square  as  n  —>  oo,  then 

(a)  E(X])  ->  E(X2) 

(b)  E(Xn)^E(X), 
and 

(c)  E(XnYn)  ^  E(XY). 


Proof.  See  Brockwell  and  Davis  (1991),  Proposition  2.1.2. 


Levy  Processes,  Brownian 
Motion  and  Ito  Calculus 


D.1  Levy  Processes 

D.2  Brownian  Motion  and  the  Ito  Integral 
D.3  Ito  Processes  and  Ito's  Formula 

D.4  Ito  Stochastic  Differential  Equations 


D.1  Levy  Processes 

Just  as  ARMA  processes  were  defined  as  stationary  solutions  of  stochastic  difference 
equations  driven  by  white  noise,  the  so-called  CARMA  (continuous-time  ARMA) 
models  arise  as  stationary  solutions  of  stochastic  differential  equations  driven  by  Levy 
processes.  In  order  to  discuss  these  equations  in  more  detail  we  first  present  a  few 
essential  facts  concerning  Levy  processes.  (For  detailed  accounts  see  Protter  2010; 
Applebaum  2004;  Bertoin  1996;  Sato  1999.)  They  have  already  been  introduced  in 
Definition  7.5.1,  but  for  ease  of  reference  we  repeat  the  definition  here. 


Definition  D.1.1. 


A  Levy  process,  {L{t),  t  e  M}  is  a  process  with  the  following  properties: 

(i)  L( 0)=0. 

(ii)  L(t)  —  L(s)  has  the  same  distribution  as  L{t  —  s )  for  all  s  and  t  such  that  s  <  t. 

(iii)  If  (s,  t)  and  (u,  v)  are  disjoint  intervals  then  L(t)  —  L(s )  and  L(v)  —  L{u)  are 
independent. 

(iv)  { L{t)}  is  continuous  in  probability,  i.e.  for  all  c  >  0  and  for  all  t  e  R, 

hmP(\L(t)-L(s)\  >0  =  0. 
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Appendix  D  Levy  Processes,  Brownian  Motion  and  Ito  Calculus 


It  is  known  that  every  Levy  process  has  a  version  with  sample-paths  which  are  right 
continuous  with  left  limits  (cadlag  for  short).  We  shall  therefore  assume  that  our  Levy 
processes  have  this  property. 

The  characteristic  function  of  L{t),  </>t(0)  :=  E(exp(iOL(t))),  has  the  celebrated 
Levy-Khinchin  representation,  for  t  >  0, 

<t>t(Q)  =  exp(f§(0)),  6  e  M, 

where 

£(0)  =  i6fi - 02cr2  +  f  ( el9x  —  1  —  idxl^\  \){x))v{dx), 

2  JR 

for  some  /x  e  R,  a  >  0,  and  measure  v.  /(-i,i)  is  the  indicator  function  of  the  set 
(— 1,  1).  The  measure  v  is  known  as  the  Levy  measure  of  the  process  L  and  satisfies 
the  conditions 

v({0})  =  0 

and 

/  min(l,  \u\2)v(du)  <  oo. 

it 

The  triplet  (a2,  v,  /x)  is  often  referred  to  as  the  characteristic  triplet  of  the  Levy  process 
and  completely  determines  all  of  its  finite-dimensional  distributions. 

The  measure  v  characterizes  the  distribution  of  the  jumps  of  the  process.  If,  in 
particular,  v  is  the  zero  measure  then  the  characteristic  function  of  L(t)  for  t  >  0,  is 
that  of  a  normal  random  variable  with  E(L(t))  —  fit  and  Var(L(0)  =  o2t  and  the 
process  { L{t ),  t  e  R}  is  Brownian  motion  (Example  7.5.1)  with  sample-paths  which 
are  continuous  (but  nowhere  differentiable). 

If  A  :=  v(R)  <  oo  then  the  expected  number  of  jumps  in  any  time-interval  of 
length  t  is  Xt  and  the  expected  number  of  jumps  with  size  in  (— oo,  x]  in  the  same 
time  interval  is  tv((— oo,  x])  =  XtF{x)  where  F  is  a  probability  distribution  function. 
The  distribution  function  F  is  known  as  the  jump-size  distribution  and  X  is  known  as 
the  mean  jump-rate.  If  a2  =  0  and  m  =  X  /(1  {)xdF(x),  then  [L{t)}  is  a  compound 
Poisson  process  with  parameters  X  and  F  (Example  7.5.2)  and  with  sample  paths  which 
are  constant  except  for  jumps. 

If  X  =  oo  then  the  expected  number  of  jumps  in  every  interval  of  positive  length 
is  infinite  and  the  process  [L{t)}  is  said  to  have  infinite  activity.  The  gamma  process 
of  Example  11.5.1  is  such  a  process  with  characteristic  triplet  (0,  v,  a(l  —  <?~^)//3), 
where  v  is  the  measure  defined  on  subsets  of  (0,  oo)  by, 

v(dx)  =  ax~1e~^xI(o,oo)(x)dx. 

The  Levy-Khinchin  representation  of  the  characteristic  function  of  L(t)  shows  that 
the  distribution  of  L(t)  can,  by  appropriate  choice  of  the  characteristic  triplet,  be  any 
infinitely  divisible  distribution.  This  family  includes  a  vast  array  of  distributions  such 
as  the  normal  distributions,  compound  Poisson  distributions,  Student’s  t-distributions, 
the  stable  distributions  and  many  others.  In  particular  it  includes  distributions  which 
have  heavy  tails  and  which  are  not  necessarily  symmetric.  These  features  allow  for 
great  flexibility  when  modelling  observed  phenomena  in  both  financial  and  physical 
contexts. 

In  this  appendix  we  shall  restrict  attention  to  Levy  processes  for  which 
FL( l)2  <  oo.  This  constraint  is  not  serious  for  most  applications  in  finance  where 
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it  is  generally  believed  that  second  moments  exist  while  higher  moments  (those  of 
order  four  or  more)  may  not.  For  Levy  processes  with  EL( l)2  <  oo  it  follows  from 
the  definition  that  there  are  finite  constants  m  and  s  >  0  such  that 

EL(t)  =  mt  and  Var(L(t ))  =  s2t  for  all  t  >  0. 

In  the  following  sections  we  shall  focus  on  Brownian  motion  and  stochastic 
differential  equations  driven  by  Brownian  motion. 

In  order  to  develop  the  necessary  tools  we  introduce  the  Ito  stochastic  integral,  Ito 
processes  and  Ito’s  formula.  Following  this  we  shall  outline  some  results  concerning 
the  solution  of  stochastic  differential  equations  and  use  them  to  expand  on  the 
treatment  of  Gaussian  CARMA  processes  and  their  Levy-driven  generalizations  in 
Section  11.5. 
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Robert  Brown  (1828)  observed  the  erratic  motion  of  pollen  particles  in  a  liquid  which 
was  later  explained  by  the  irregular  bombardment  of  the  particles  by  the  molecules  of 
the  liquid.  In  order  to  provide  a  mathematical  model  for  the  one-dimensional  version  of 
this  process,  Einstein  (1905)  postulated  the  existence  of  a  process  satisfying  conditions 
(i)-(iii)  of  Definition  D.l.l  withL(f)  normally  distributed  for  every  t.  Bachelier  (1900) 
had  in  fact  already  proposed  such  a  model  for  the  prices  of  stocks  on  the  Paris  stock 
exchange.  It  was  later  shown  by  Wiener  that  there  is  a  process  with  continuous  sample- 
paths  satisfying  these  conditions,  a  process  which  has  come  to  be  known  as  a  Brownian 
motion  or  Wiener  process.  It  is  in  fact  the  only  Levy  process  with  continuous  sample- 
paths,  a  feature  which  adds  to  its  plausibility  as  a  model  for  the  physical  process 
originally  observed  by  Brown.  Although  the  sample-paths  are  continuous  they  are  far 
from  smooth  in  the  sense  that  they  are  nowhere  differentiable.  We  shall  not  attempt  to 
prove  these  properties  here  but  refer  to  the  books  of  Mikosch  (1998),  Klebaner  (2005) 
and  Oksendal  (2013)  for  further  details.  In  the  following  sections  we  shall  give  an 
outline  of  the  essentials  of  Ito  calculus  adapted  from  the  more  extensive  treatment  of 
0ksendal. 

For  modelling  more  complex  physical  phenomena  it  is  often  appropriate  to 
suppose  that  the  increment  dX{t)  of  the  observed  process  {X(t)}  in  the  infinitesimally 
small  time  interval  ( t ,  t  +  dt)  satisfies  an  equation  of  the  form 

dX(t )  =  b(t,X(t))dt  +  cr(t,X(t))dB(t),  S  <t<T,  (D.2.1) 


where  dB(t )  denotes  the  increment  of  a  standard  Brownian  motion  in  the  same  time 
interval.  In  order  to  attach  a  precise  meaning  to  (D.2. 1)  we  first  consider  the  following 
discrete  approximation.  For  any  fixed  positive  integer  n ,  consider  the  grid  of  time 
points  {2 ~nk,  k  e  Z]  and  define 

I2~nk,  if  5  <  2 ~nk  <  T, 

S,  if  2 ~nk  <  S ,  (D.2.2) 

T,  if  2 ~nk  >  T. 


A  discrete  approximation  to  (D.2.1)  is  then 


(D.2. 3) 
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where  XJ  \=  X(tj ),  A tj  \=  tj+ \  —  tj ,  and  A Bj  :=  5(/y+ 1)  —  B(tj).  For  given  functions 
b  and  a  and  for  any  given  initial  condition,  X"2,IS^  =  X(S),  and  values  of  B(tj)  J  <  K 
equation  (D.2.3)  can  be  solved  recursively  for  X"  J  <  ^  The  solution  satisfies 

yVi  =  X(5)  +  J2  b (fc,  **)  Aft  +  £  o-  (ft,  V)A5/o  [2"5]  <./<  [2nr], 

*</  *</ 

(D.2.4) 

This  suggests  that,  under  suitable  conditions,  as  n  — »  oo,  the  random  variables 
XJ,  [2nS]  <j<  [2nT]  +  1,  approximate  (in  a  sense  to  be  specified)  a  random  process 
|A(t),  S  <  t  <  T}  satisfying 

X{t)  =  X(S)+  [  b(u,  X(u))du+  f  cr(u,X(u))dB(u),  S  <  t  <T,  (D.2.5) 

Js  Js 

In  order  to  make  sense  of  these  statements,  and  to  solve  equations  of  the  form 
(D.2.5)  we  must  first  define  what  is  meant  by  the  integrals  on  the  right-hand  side.  We 
shall  do  this  for  non-anticipating  integrands.  The  random  process  (Z(t)}  is  said  to  be 
a  non-anticipating  function  of  {5(0}  if,  for  each  t ,  X{t)  is  a  function  of  {5(5’),  s  <  t}. 
This  property  is  the  continuous-time  analogue  of  causality,  which  we  introduced  in 
connection  with  ARM  A  processes  in  Chapter  3.  We  shall  use  the  notation  to  denote 
the  class  of  random  variables  on  (£2,  P)  (the  probability  space  on  which  {5(0}  is 
defined)  which  are  functions  of  {5(y),  s  <  t}.  In  this  terminology  {X(0}  is  a  non¬ 
anticipating  function  of  {5(0}  if  X(t)  e  PPt  for  all  t. 

To  deal  with  the  first  integral  in  (D.2.5)  we  consider  integrals  of  the  form 

T 

m(u)du ,  S  <  T,  (D.2.6) 

for  functions  m  on  R  x  £2  belonging  to  the  family  ^(S,  T )  defined  by  the  properties 
(i)-(iii)  below.  For  clarity  we  have  suppressed  the  dependence  on  oo  e  £2  in  (D.2.5) 
and  (D.2.6),  but  in  fact  X  and  m  are  both  functions  on  R  x  £2  with  values  X(u ,  co)  and 
m{u ,  ( o )  respectively. 

Defining  properties  of  me  T ): 


(i)  m(-,  •)  is  a  measurable  function  on  R  x  £2. 


(ii)  m(t ,  •)  e  for  each  t  e  R. 


P 


m(u ,  co)\du  <  oo 


For  m  e  ^#(5,  T )  the  integrals  f^m(u)du,  t  e  [ S ,  T],  can  be  defined  for  all  co 
outside  a  set  of  probability  zero  as  straightforward  Lebesgue  integrals,  continuous 
in  t.  Specifying  them  to  be  zero  on  the  exceptional  subset  of  £2  defines  fs  m(u)du ,  t  e 
[i S ,  T],  as  a  continuous  function  of  t  for  each  co. 

In  order  to  attach  a  meaning  to  the  second  integral  in  (D.2.5)  we  need  to  define 
integrals  of  the  form 

T 

f(u)dB(u),  S  <  T,  (D.2.7) 

where  the  random  variables /(w),  defined  on  the  same  probability  space  (£2,  5)  as 

{5(0},  satisfy  the  properties  (i)-(iii)  specified  below.  We  shall  denote  the  class  of  such 
functions  as  PK(S,  T )  and  an  integral  of  the  form  (D.2.7)  as  an  Ito  integral. 
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Defining  properties  off  e  <sK(S,  T ): 


(i)  /(•,■)  is  a  measurable  function  on  R  x  Q. 

(ii)  f(t,  •)  e  & \  for  each  t  e  R. 

(iii)  E  f$  fit,  <  oo. 


The  construction  of  the  integral  (D.2.7)  is  achieved  by  defining  it  for  elementary 
functions  and  then  extending  the  definition  to  all  functions/  e  T).  The  function 

e  is  an  elementary  function  if  for  some  positive  integer  n , 

oo 

e(u ,  co)  =  ej(co)I(2-nj,2-n(j+i)](u),  u  e  R,  co  e  £2,  (D.2. 8) 

7— — 

where  the  random  variables  ej  belong  to  ^tj  for  all  j  and  the  times  tj  are  defined  as  in 
(D.2. 2).  Since  the  function  e(u,  co)  is  independent  of  u  on  the  interval  (2 ~nj,  2~n(j+l)], 
and  since  B  increases  on  that  interval  by  A Bj  :=  B(tj+\)  —  B{tj ),  it  is  natural  to  define 
(suppressing  co  as  in  (D.2.7)), 

pT  oo 

ISj(e)  =  /  e(u)dB(u)  ej^Bji  S  <  T.  (D.2. 9) 

s  j=-o O 


Proposition  D.2.1. 


If  e  is  bounded  and  elementary  then 


E 


e{u)dB(u) 


T 


S  <T. 


(D.2. 10) 


Proof.  Observing  that  E(eiejABiABj)  =  8ijE(ej)Atj,  where  /y  =  1  if  i  —  j  and  0  otherwise, 
we  can  rewrite  the  left-hand  side  of  (D.2. 10)  as 

oo  oo  oo  oo  pj 

e  E  (eieJABiABj)  =  E  E(ehAtJ  =  E  E  ejAti =  E  I  e(t)2  dt- 

i=— oo  j=—o o  j=— oo  j=—oo  S 


Remark  1.  The  left-hand  side  of  (D.2. 10)  is  the  squared  norm  of  the  random  variable 
hj(e)  defined  on  (£2,  2^,  P).  The  right-hand  side  is  the  squared  norm  of  the  function 


e*(u ,  co)  := 


e{u ,  co), 


if  ( u ,  co)  e  [S,  T]  x  £2 , 
otherwise, 


a  square  integrable  function  on  the  product  space  [ S ,  T]  x  £2  with  respect  to  the  product 
measure  £  x  P,  where  f  denotes  Lebesgue  measure.  The  mapping  e  \->  Isj(e)  thus 
determines  an  isometry  from  the  restrictions  e*  of  the  bounded  elementary  functions 
e  to  [5,  T]  x  £2  into  the  space  of  square  integrable  random  variables  on  (£2,  P). 

It  can  be  shown  (see  e.g.,  Oksendal  2013)  that  for  every  function/  e  JE{S,  T) 
there  is  a  sequence  of  bounded  elementary  functions  { en }  such  that 

E  I  (en(u)  —f(u))2du  -»  0  as  n  — >  oo. 

Js 


(D.2. 11) 
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T 

This  implies  that  E  fs  ( en{u )  —  em(u))2du  ->  0  as  m  and  n  both  go  to  oo  and,  by  the 
isometry,  that 

E(Is,r(en  —  em))2  =  E  (lsj(en)  —  ->  0. 

By  the  Cauchy  property  of  mean  square  convergence  (Appendix  C.l)  it  follows  that 
{Isj(en)}  has  a  mean  square  limit. 

If  {gn}  is  another  sequence  of  bounded  elementary  functions  with  the  property 
(D.2.11)  then  E  f‘  (en(u)  —  gn{u))2du  ->  0  as  n  ->  00  so  that 

E(Jsj((en  ~  gn ))2  =  E(ISj(en)  —  Is,r(gn))2  0. 

Hence  the  mean  square  limit  of  Isj(en)  is  the  same  for  all  sequences  of  bounded 
elementary  functions  satisfying  (D.2. 1 1)  and  the  common  limit  is  defined  to  be  Isjif)- 
Thus  hj{f)  can  be  defined  unambiguously  as 


hj(f)  •—  lim/yrCCi) 


(D.2.12) 


m.s. 


where  {en}  is  any  sequence  of  bounded  elementary  functions  satisfying  (D.2.11). 
Moreover  iff  e  yK(S,  T)  and  {en}  satisfies  (D.2.11),  then 

E  {hj(f)2)  =  lim  E(lSj(en )2)  =  lim  E  f  e2n(u)du  =  E  f  f(u)du, 

n  — >  00  7  n^oo  Js  Js 

showing  that  the  isometry  of  the  restrictions  of  bounded  elementary  functions 
extends  to  the  corresponding  restrictions/*  of  all  functions  in  JE (S,  T ). 

This  means  that,  in  principle,  Isjif)  can  be  evaluated  as  the  mean-square  limit 

of  fs  xn(u)dB{u)  where  [xn]  is  any  (not  necessarily  bounded)  sequence  of  elementary 

functions  such  that  E  Js  ( xn(u )  —  f(u))  du  0  as  n  — >  00.  In  particular  it  can  be 
shown  in  this  way  that 

J  B(u)dB(u)  =  l-(B2(T)  -  B2(S))  -  1-{T  -  S). 

We  shall  not  go  into  the  details  as  we  shall  derive  this  result  in  a  much  simpler  way 
using  the  tools  of  Ito  calculus  to  be  discussed  in  the  following  section.  □ 

Remark  2.  If  /  e  ^(S,  T )  then  for  each  t  e  [ S ,  T ]  so  also  is  the  function, 
{ f(co ,  u)l[sj](u),  co  e  £2,  u  e  R},  where  is  the  indicator  function  of  the  set  [ S ,  t\. 
This  enables  us  to  define 


/' 


-J. 


f(u)dB(u )  \=  f{u)  1[S,  t](u)du 


for  each  t  e  [5,  T\  and  each/  €  -  I  (.S’.  T). 


□ 


Remark  3.  Iff  e  JE  :=  n^(S,  T ),  where  Pi  denotes  the  intersection  over  all  S  e  R 
and  T  e  R  such  that  S  <  T,  then  /u(/)  is  defined  for  all  real- valued  s  and  t  such  that 
s  <  t  and  the  integral  has  the  properties, 


(i)  EIs,(f)  =  0. 

(ii)  Is,u(f)=kt(f)+hu<f),  S<t<U. 

(iii)  CMf  +  bg)  =  alsj(f  )  +  blsj(g)  for  all  a,  b  e  P:  and  g  €  jV . 

(iv)  E  [ls,t(f)IsXg)]  =Ej,J(u)g(u)du  for  all  g  €  jT . 
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(v)  For  each  fixed  s  e  M,  {ISJ(f),  t  >  s}  is  an  ^-martingale,  i.e.  E\ISJ(f)\  <  oo  and 

E(Is,u(f)\B(y),y  <  t)  =  ISJ(f),  u>t>s. 

(vi)  For  each  fixed  s  e  R  and  for  each  fixed  T  >  s  there  is  a  version  of  [Is,t(f),  s  < 
t  <  T}  which  is  continuous  in  t.  In  other  words  there  is  a  process  { Xt ,  s  <  t  <  T} 
with  continuous  sample-paths  such  that 

P(Xt  =  f  f(u)dB(u))  =  1  for  all  t  e  [s,  T]. 

J  s 

Properties  (i)-(iv)  are  clearly  true  for  bounded  elementary  functions  /  and  g.  Their 
validity  for  functions  in  JX  can  be  established  by  taking  limits.  Property  (v)  follows 
from  (ii)  and  the  independence  of  the  increments  of  {5(0}.  The  proof  of  property  (vi) 
is  beyond  the  scope  of  this  book  [see,  e.g.,  Oksendal  (2013)  for  details].  □ 


D.3  Ito  Processes  and  Ito's  Formula 

Direct  evaluation  of  Ito  stochastic  integrals  from  the  definition  (D.2.12)  is  very  messy. 
For  example,  it  can  be  shown  by  a  lengthy  calculation  from  the  definition  that 

1  1 

B(u)dB(u )  =  - B(t )2 - 1. 

2  2 

Ito’s  formula  provides  a  chain  rule  for  evaluating  such  integrals.  It  is  clear  from  this 
example  that  the  classic  rule  for  Riemann  integration  does  not  apply.  If,  for  example, 
we  apply  it  in  this  particular  case  we  find,  from  the  rule  d(x 2)  =  2 xdx,  that  the  integral 
is  \B(t)2  instead  of  the  correct  expression  above.  Before  we  can  derive  the  appropriate 
rule  however  we  first  need  to  define  what  is  meant  by  an  Ito  process. 

Ito  Process 

This  is  a  process  which  satisfies  (suppressing  the  argument  co  as  before) 

X(t)  —  X(s)  +  f  m(u)du  +  f  f(u)dB(u),  s  <  t  e  M,  (D.3. 1) 

J  s  J  s 

where 


X(t)  e  for  all  te  R, 


(D.3.2) 


m  €  JOS,  T )  for  all  S  <  T  e  R 


(D.3. 3) 


and 


/  6  JX*{S,  T )  for  all  S  <  T  e  R, 


(D.3.4) 


with  , // (.S’,  T)  defined  as  in  Section  E.2  and  ■jV*(S,  T)  defined  like  ■  I  (.S'.  T)  in 
Section  E.2  except  for  the  replacement  of  property  (iii),  E  jjf(it)2dii^  <  oo,  by 
the  weaker  condition, 

(iii)*  P  f(u)2du  <  ooj  =  1. 

It  can  be  shown  that,  under  this  weaker  condition,  the  integrals  Is,t(f ),  s  <  t  e  R, 
can  still  be  defined,  retaining  all  of  the  properties  in  Remark  3  of  Section  E.2  with  the 
exception  of  the  martingale  property  (v). 
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Example  D.3.1. 


Definition  (D.3.1)  is  often  written  in  the  shorthand  notation, 

dX(t )  —  m(t)  dt  +f(t)  dB(t).  (D.3.5) 

Both  of  the  integrals  in  (D.3.1)  are  assumed  to  be  continuous  versions  so  that  the  Ito 
process  {X(r)}  is  also  continuous.  The  first  integral  is  usually  referred  to  as  the  drift 
component  of  {X(r)}  and  the  second  as  the  Brownian  component. 

Ito’s  Formula 

Ito’s  formula  is  concerned  with  smooth  functions  of  Ito  processes.  Specifically  it 
states  that  if  {X(t)}  is  an  Ito  process  satisfying  (D.3.5)  and  {g(t,  x)}  is  a  function  on 
RxR  with  continuous  partial  derivatives  dg/dt  and  d2g/dx2  then 

(i)  Y(t)  \=  git ,  X(t ))  is  an  Ito  process  and 

(ii) 


9c  9c  1  92c 

dY(t)  =  X(t))  dt  +  [t ,  X(0)  dX(t)  +  -^§(r,  X(t))  {d.X(t)f, 


(D.3.6) 


where  dX(t )  =  mdt  +/  dB(t )  and  (, dX{t ))2  —f 2  dt. 


Writing  gt,  gx  and  gxx  for  the  corresponding  partial  derivatives  of  g  evaluated  at 
(t,  X(t)),  and  substituting  for  dX(t)  and  dX(t)2  as  indicated  in  (ii),  we  can  write  the 
increment  of  Y(t)  explicitly  in  the  form  (D.3.5)  as 


1  9 

dY(t)  =  (gt  +  mgx  +  -v2gxx)  dt  +fgx  dB(t ). 


(D.3.7) 


f^B(u)dB(u) 


Inspection  of  (D.3.7)  suggests  that  in  order  to  find  a  process  with  increments  B(u)dB(u) 
we  should  start  with  the  Ito  process  X(t)  =  B(t),  for  which  m  =  0  and /  =  1,  and  define 
Y(t)  —  g(t ,  X(t ))  where  gx(t,  x )  =  x.  Taking  g(t,  x)  =  x2/2  we  obtain,  from  (D.3.7), 

1 

dY(t)  =  -  dt  +  B(t)dB(t ), 
which  gives 

J  B(u)dB(u)  =  Y(t)  -  7(0)  -  X-t  =  l-B{t)2  -  U. 

°  □ 

Multivariate  Ito  Processes 

An  ^-dimensional  Ito  process  {X(f)}  is  defined  to  be  an  ^-dimensional  vector- valued 
process  satisfying  an  equation  (cf.  (D.3.5), 

dX(t)  =  m(0  dt  +  F(t)  dB(t ),  (D.3.8) 

where  (B(t)}  is  m-dimensional  standard  Brownian  motion,  i.e.  an  m-dimensional 
random  process  with  components  which  are  independent  one-dimensional  standard 
Brownian  motions,  the  components  of  the  n-vectors  X(t)  and  m(f)  satisfy  (D.3.2)  and 
(D.3.3)  respectively,  and  each  component^-  of  the  n  x  m  matrix  F(t)  satisfies  (D.3.4). 
The  more  explicit  form  of  (D.3.8),  corresponding  to  (D.3.1),  is 

X(0  =  X(s)  +  f  m (u)  du  +  f  F(u )  dB(u),  s  <  t  e  R. 


(D.3.9) 
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The  Multidimensional  Ito  Formula 

The  multidimensional  version  of  Ito’s  formula  states  that  if  {X(0}  is  an  ^-dimensional 
Ito  process  satisfying  (D.3.9)  and  {git,  x)}  is  a  function  on  R  x  R77  with  values  in  R^ 
and  with  continuous  second  partial  derivatives,  then 

(i)  {Y (t)  :=  git ,  X(0)  is  a p-dimensional  Ito  process  and 

(ii) 


dYj(t)  = 


dt  + 

7=1 


7=1  k=  1 


dXj(t)dXk(t), 


(D.3.I0) 


where  Xt,  Yt  and  gt  are  the  components  of  X,  Y  and  g  respectively,  and  the  partial 
derivatives  of  g  are  all  evaluated  at  (7,  X(7)).  The  increments  dXj  satisfy  the  relations 
dXj(t)  —  Mj  dt  +  Y^'r=\fjr  dBr(t )  and  dXjifjdXfft )  =  i  fjrfkr  dt ,  where  nq  and ff 

are  the  components  of  m(7)  and  F(t)  respectively. 


In  the  following  section  we  shall  consider  solutions  of  stochastic  differential 
equations  of  the  form 

dX(t)  =  b(t,X(t))dt  +  a(t,X(t))dB(t),  SctcT ,  Xs  =  Z,  (D.3.11) 

where  {B(f)}  is  m-dimensional  standard  Brownian  motion.  Conditions  on  the  functions 
b  and  o  and  the  initial  random  variable  Z  which  guarantee  existence  and  uniqueness 
of  solutions  will  be  specified  in  Theorem  D.4.1,  a  proof  of  which  can  be  found  in 
Oksendal  (2013). 


D.4  Ito  Stochastic  Differential  Equations 

The  equation  (D.3.11)  is  known  as  an  ltd  stochastic  differential  equation  for  the  Re¬ 
valued  random  process  (X(7)}.  Equations  (7.5.6),  for  geometric  Brownian  motion, 
(11.5.2),  for  the  CAR(l)  process,  and  (11.5.9),  for  the  state  vector  of  a  CARMA 
process,  are  special  cases.  It  is  trivial  to  check,  in  each  of  these  cases,  that  the  conditions 
on  b  and  a  given  in  the  following  theorem  are  satisfied  for  all  S  and  T  e  R  with  T  >  S. 
Provided  the  conditions  on  the  initial  random  vector  Z  are  satisfied,  these  guarantee 
the  existence  and  uniqueness  of  a  continuous  solution  of  (D.3.11).  After  stating  the 
theorem  we  shall  use  Ito’s  formula  to  derive  solutions  of  the  particular  Ito  equations 
(7.5.6)  and  (11.5.9).  The  solution  of  (11.5.2)  was  discussed  in  Section  11.5.1. 

Theorem  D.4.1.  Suppose  that  S  <  T  e  R  and  that  the  measurable  functions  b  : 
l S ,  T]  x  W1  W1  and  a  :  [S,  T]  x  R77  R77  x  R777  in  ( D.3.11 )  have  the  properties 

\b(t,  x)|  +  | a (7,  x)|  <  C(1  +  |x|),  x  £  R7\  t  e  [ S ,  T ] 

and 


I  bit,  x  -  bit ,  y)|  +  |  cr(t,  x  -  a(7,  y)|  <  £>|x  -  y|, 

where  C  and  D  are  finite  positive  constants  and  \M\  denotes  the  (positive)  square  root 
of  the  sum  of  squares  of  the  components  of  the  matrix  or  vector  M.  If  Z  is  a  random 
variable  independent  of  [Bit)  —  Bis),  S  <  s  <  t  <  T}  such  that  E\Z\2  <  oo,  then  the 
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stochastic  differential  equation  (D.3.11)  has  a  unique  continuous  (in  t)  solution,  each 
component  of  which  belongs  to  jV*[S,  T]  as  defined  in  (D.3.4). 

Geometric  Brownian  Motion 

Geometric  Brownian  motion  was  introduced  in  Section  7.5.2  as  a  continuous-time 
model  for  asset  prices  and  was  the  basis  for  the  derivations  by  Black  and  Scholes 
(1973)  and  Merton  (1973)  of  the  option-pricing  formula  discussed  in  Section  7.6.  Here 
we  shall  use  Ito’s  formula  to  find  the  solution  [P(t),  t  >  0}  of  the  defining  differential 
equation, 

dP(t )  =  P(t)\fidt  +  crdB(t)],  t  >  0,  (D.4.1) 

where  P( 0)  is  a  strictly  positive  random  variable,  independent  of  { B(t )  —  B(s),  0  < 
£<^<00}.  The  standard  calculus  identity,  <i(log(y))  =  dy/y,  suggests  that  we  try 
applying  Ito’s  formula  with  X(t)  —  P(t)  and  g(x,  t)  —  log(x).  The  function  g  has 
continuous  partial  derivatives,  dg/dt  =  0,  dg/dx  =  1/x  and  d2g/dx 2  =  —1/x2  on  the 
set  where  x  >  0.  Substituting  in  (D.3.6)  and  using  (D.4.1)  we  obtain  □ 


d(logP(t)) 


1  9 

2P(t)2  ^ dP =  lldt  +  adB ^ 


(D.4.2) 


whence 

a2 

log(P(0)  -  log(P(0))  =  (/x  -  —  )t  +  oB(t). 
This  is  equivalent  to  the  solution  (7.5.7)  given  earlier. 


Gaussian  CARMA  Processes 

The  state  equation  (11.5.9)  for  the  Gaussian  CARMA (p,  q)  process,  i.e. 

dX(t)  =  AX(t)dt  +  edB(t),  (D.4.3) 

where  X(0)  is  independent  of  {B(t)  —B(s),  0  <  51  <  t  <  T}  and  £,|X(0)  |2  <  00,  clearly 
satisfies  the  conditions  of  Theorem  D.4.1  and  therefore  has  a  unique  solution  which  is 
continuous  in  t.  In  order  to  find  the  solution  we  multiply  both  sides  by  the  integrating 
factor  e~At ,  as  we  would  if  {5(0}  were  deterministic.  Since  e~At  is  non-singular  the 
state  equation  is  equivalent  to  the  equation 

e~A,dX(t )  -  e~AlAX(t)dt  =  e~AtedB(t).  (D.4.4) 


This  form  of  the  equation  suggests  applying  the  multivariate  Ito  formula  with  g(t,  x)  = 
e~Atx.  The  second  derivatives  of  g  are  all  continuous  and  satisfy 


0  for  all  ij  and  k , 


and 


9g 

3 1 
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where  er,  r  e  denotes  a  ^-component  column  vector,  all  of  whose 

components  are  zero  except  for  the  rth,  which  is  one.  Substituting  these  derivatives 
into  (D.3.10)  and  writing  the  resulting  equations  in  vector  form  we  obtain 

d(e~A'X(t ))  =  -Ae~AtX(t)dt  +  e~A‘dX(t). 

Substituting  this  expression  in  (D.4.4)  gives 

d(e~AlX(t))  =  e~AtdB(t), 


which  implies  that 

e~A,X(t)  -  X(0)  =  f  e~AuedB(u), 

Jo 

or  equivalently 

X(t)  =  eArX(0)  +  f  ^‘-^edBiu),  0  <t<T.  (D.4.5) 

Jo 

Since  equation  (D.4.1),  with  X(5)  independent  of  { B(t )  —  B(s ),  S  <  s  <  t  <  T] 
and  £'|X(S)  |2  <  oo,  satisfies  the  conditions  of  Theorem  D.4. 1  for  all  S  e  R  and  T  e  R 
such  that  S  <  T,  exactly  the  same  arguments  give  the  more  general  relation, 


X(t)  =  /(;-5)X(5)  +  f  eA{,~u)^  dB(u),  t  >  S,  for  all  S  el.  (D.4.6) 

Js 

This  is  equation  (11.5.11)  for  which  we  showed  (in  Section  11.5.2)  that  the  unique 
causal  stationary  solution  is 

X(t)  =  f  t  e  R. 

2—00 

This  led,  with  (1 1.5.8),  to  the  definition  of  the  zero-mean  causal  CARMA (/?,  q )  process 
{ Y(t ),  t  e  R}  as 


f  bV(r“H)e  dB(u) 

—  OO 


and,  more  generally  in  Section  1 1.5.3,  to  the  second-order  Levy-driven  CARMA (p,  q) 
process, 

YU)=  f 

J  (-00,  t] 


b ll)edL(u). 
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Appendix  E  An  ITSM  Tutorial 


E.1  Getting  Started 

E.1.1  Running  ITSM 

Double-clicking  on  the  ITSM  or  the  ITSM-Shortcut  icon  will  open  the  ITSM  window. 
To  analyze  one  of  the  data  sets  provided,  select  File>Proj  ect  >Open  at  the  top 
left  corner  of  the  ITSM  window. 

There  are  several  distinct  functions  of  the  program  ITSM.  The  first  is  to  analyze 
and  display  the  properties  of  time  series  data,  the  second  is  to  compute  and  display  the 
properties  of  time  series  models,  and  the  third  is  to  combine  these  functions  in  order  to 
fit  models  to  data.  The  last  of  these  includes  checking  that  the  properties  of  the  fitted 
model  match  those  of  the  data  in  a  suitable  sense.  Having  found  an  appropriate  model, 
we  can  (for  example)  then  use  it  in  conjunction  with  the  data  to  forecast  future  values 
of  the  series.  Sections  E.2-E.5  of  this  appendix  deal  with  the  modeling  and  analysis  of 
data,  while  Section  E.6  is  concerned  with  model  properties.  Section  E.7  explains  how 
to  open  multivariate  projects  in  ITSM.  Examples  of  the  analysis  of  multivariate  time 
series  are  given  in  Chapter  8. 

It  is  important  to  keep  in  mind  the  distinction  between  data  and  model  properties 
and  not  to  confuse  the  data  with  the  model.  In  any  one  project  ITSM  stores  one  data 
set  and  one  model  (which  can  be  identified  by  highlighting  the  project  window  and 
pressing  the  red  INFO  button  at  the  top  of  the  ITSM  window).  Until  a  model  is  entered 
by  the  user,  ITSM  stores  the  default  model  of  white  noise  with  variance  1 .  If  the  data  are 
transformed  (e.g.,  differenced  and  mean-corrected),  then  the  data  are  replaced  in  ITSM 
by  the  transformed  data.  (The  original  data  can,  however,  be  restored  by  inverting  the 
transformations.)  Rarely  (if  ever)  is  a  real  time  series  generated  by  a  model  as  simple 
as  those  used  for  fitting  purposes.  In  model  fitting  the  objective  is  to  develop  a  model 
that  mimics  important  features  of  the  data,  but  is  still  simple  enough  to  be  used  with 
relative  ease. 

The  following  sections  constitute  a  tutorial  that  illustrates  the  use  of  some  of  the 
features  of  ITSM  by  leading  you  through  a  complete  analysis  of  the  well-known  airline 
passenger  series  of  Box  and  Jenkins  (1976)  filed  as  AIRPASS.TSM  in  the  ITSM2000 
folder. 


E.2  Preparing  Your  Data  for  Modeling 

The  observed  values  of  your  time  series  should  be  available  in  a  single-column  ASCII 
file  (or  two  columns  for  a  bivariate  series).  The  file,  like  those  provided  with  the 
package,  should  be  given  a  name  with  suffix  .TSM.  You  can  then  begin  model 
fitting  with  ITSM.  The  program  will  read  your  data  from  the  file,  plot  it  on  the 
screen,  compute  sample  statistics,  and  allow  you  to  make  a  number  of  transformations 
designed  to  make  your  transformed  data  representable  as  a  realization  of  a  zero-mean 
stationary  process. 

Example  E.2.1.  To  illustrate  the  analysis  we  shall  use  the  file  AIRPASS.TSM,  which  contains  the 

number  of  international  airline  passengers  (in  thousands)  for  each  month  from  January, 
1949,  through  December,  1960. 

□ 
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E.2.1  Entering  Data 

Once  you  have  opened  the  ITSM  window  as  described  above  under  Getting  Started, 
select  the  options  File>Project>Open,  and  you  will  see  a  dialog  box  in  which  you 
can  check  either  Univariate  or  Multivariate.  Since  the  data  set  for  this 
example  is  univariate,  make  sure  that  the  univariate  option  is  checked  and  then  click 
OK.  A  window  labeled  Open  File  will  then  appear,  in  which  you  can  either  type 
the  name  AIRPASS.TSM  and  click  Open,  or  else  locate  the  icon  for  AIRPASS.TSM 
in  the  Open  File  window  and  double-click  on  it.  You  will  then  see  a  graph  of  the 
monthly  international  airline  passenger  totals  (measured  in  thousands)  X\, . . .  ,Xn, 
with  n  =  144.  Directly  behind  the  graph  is  a  window  containing  data  summary 
statistics. 

An  additional,  second,  project  can  be  opened  by  repeating  the  procedure  described 
in  the  preceding  paragraph.  Alternatively,  the  data  can  be  replaced  in  the  cur¬ 
rent  project  using  the  option  File>Import  File.  This  option  is  useful  if  you 
wish  to  examine  how  well  a  fitted  model  represents  a  different  data  set.  (See  the 
entry  Pro  j  ec  t  Edi  t  or  in  the  ITSM_HELP  Files  for  information  on  multiple  project 
management.  Each  ITSM  project  has  its  own  data  set  and  model.)  For  the  purpose  of 
this  introduction  we  shall  open  only  one  project. 


E.2. 2  Information 

If,  with  the  window  labeled  AIRPASS.TSM  highlighted,  you  press  the  red  INFO 
button  at  the  top  of  the  ITSM  window,  you  will  see  the  sample  mean,  sample  variance, 
estimated  standard  deviation  of  the  sample  mean,  and  the  current  model  (white  noise 
with  variance  1). 

Example  E.2. 2.  Go  through  the  steps  in  Entering  Data  to  open  the  project  AIRPASS.TSM  and  use  the 

INFO  button  to  determine  the  sample  mean  and  variance  of  the  series. 

□ 


E.2.3  Filing  Data 

You  may  wish  to  transform  your  data  using  ITSM  and  then  store  it  in  another  file.  At 
any  time  before  or  after  transforming  the  data  in  ITSM,  the  data  can  be  exported  to 
a  file  by  clicking  on  the  red  Export  button,  selecting  Time  Series  and  File, 
clicking  OK,  and  specifying  a  new  file  name.  The  numerical  values  of  the  series  can 
also  be  pasted  to  the  clipboard  (and  from  there  into  another  document)  in  the  same  way 
by  choosing  Cl  ipboard instead  of  Fi  le.  Other  quantities  computed  by  the  program 
(e.g.,  the  residuals  from  the  current  model)  can  be  filed  or  pasted  to  the  clipboard  in 
the  same  way  by  making  the  appropriate  selection  in  the  Export  dialog  box.  Graphs 
can  also  be  pasted  to  the  clipboard  by  right-clicking  on  them  and  selecting  Copy  to 
Clipboard. 

Example  E.2.3.  Copy  the  series  AIRPASS.TSM  to  the  clipboard,  open  Wordpad  or  some  convenient 

screen  editor,  and  choose  Edit>Pasteto  insert  the  series  into  your  new  document. 
Then  copy  the  graph  of  the  series  to  the  clipboard  and  insert  it  into  your  document  in 
the  same  way. 

□ 


390 


Appendix  E  An  ITSM  Tutorial 


Example  E.2.4. 


Example  E.2.5. 


E.2.4  Plotting  Data 

A  time  series  graph  is  automatically  plotted  when  you  open  a  data  file  (with  time 
measured  in  units  of  the  interval  between  observations,  i.e.,  t  —  1,  2,  3,  . .  .)•  To  see 
a  histogram  of  the  data  press  the  rightmost  yellow  button  at  the  top  of  the  ITSM 
screen.  If  you  wish  to  adjust  the  number  of  bins  in  the  histogram,  select 
Statistics>Histogram>Set  Bin  Count  and  specify  the  number  of  bins 
required.  The  histogram  will  then  be  replotted  accordingly. 

To  insert  any  of  the  ITSM  graphs  into  a  text  document,  right-click  on  the  graph 
concerned,  select  Copy  to  Clipboard,  and  the  graph  will  be  copied  to  the 
clipboard.  It  can  then  be  pasted  into  a  document  opened  by  any  standard  text  editor 
such  as  MS -Word  or  Wordpad  using  theEdit>Paste  option  in  the  screen  editor.  The 
graph  can  also  be  sent  directly  to  a  printer  by  right-clicking  on  the  graph  and  selecting 
Print.  Another  useful  graphics  feature  is  provided  by  the  white  Zoom  buttons  at  the 
top  of  the  ITSM  screen.  The  first  and  second  of  these  enable  you  to  enlarge  a  designated 
segment  or  box,  respectively,  of  any  of  the  graphs.  The  third  button  restores  the  original 
graph. 

Continuing  with  our  analysis  of  AIRPASS.TSM,  press  the  yellow  histogram  button 
to  see  a  histogram  of  the  data.  Replot  the  histogram  with  20  bins  by  selecting 
Statistics >Histogram>Set  Bin  Count. 

□ 


E.2.5  Transforming  Data 

Transformations  are  applied  in  order  to  produce  data  that  can  be  successfully  modeled 
as  “stationary  time  series.”  In  particular,  they  are  used  to  eliminate  trend  and  cyclic 
components  and  to  achieve  approximate  constancy  of  level  and  variability  with  time. 

The  airline  passenger  data  (see  Figure  10-4)  are  clearly  not  stationary.  The  level  and 
variability  both  increase  with  time,  and  there  appears  to  be  a  large  seasonal  component 
(with  period  12).  They  must  therefore  be  transformed  in  order  to  be  represented  as 
a  realization  of  a  stationary  time  series  using  one  or  more  of  the  transformations 
available  for  this  purpose  in  ITSM. 

□ 


Box-Cox  Transformations 

Box-Cox  transformations  are  performed  by  selecting  Transf  orm>Box-Cox  and 
specifying  the  value  of  the  Box-Cox  parameter  A.  If  the  original  observations  are 
Y\,  Y2, . . . ,  Yn,  the  Box-Cox  transformation  f\  converts  them  to  f\(Y\),fx(Y2),  . . . , 
fx.(Yn),  where 


A  7^  0, 
A  =  0. 


These  transformations  are  useful  when  the  variability  of  the  data  increases  or 
decreases  with  the  level.  By  suitable  choice  of  A,  the  variability  can  often  be  made 
nearly  constant.  In  particular,  for  positive  data  whose  standard  deviation  increases 
linearly  with  level,  the  variability  can  be  stabilized  by  choosing  A  =  0. 

The  choice  of  A  can  be  made  visually  by  watching  the  graph  of  the  data  when 
you  click  on  the  pointer  in  the  Box-Cox  dialog  box  and  drag  it  back  and  forth  along 
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Example  E.2. 6. 


Figure  E-1 

The  series  AIRPASS.TSM 
after  taking  logs 


the  scale,  which  runs  from  zero  to  1.5.  Very  often  it  is  found  that  no  transformation  is 
needed  or  that  the  choice  A  =  0  is  satisfactory. 

For  the  series  AIRPASS.TSM,  the  variability  increases  with  level,  and  the  data  are 
strictly  positive.  Taking  natural  logarithms  (i.e.,  choosing  a  Box-Cox  transformation 
with  X  =  0)  gives  the  transformed  data  shown  in  Figure  E-1. 

Notice  how  the  amplitude  of  the  fluctuations  no  longer  increases  with  the  level  of 
the  data.  However,  the  seasonal  effect  remains,  as  does  the  upward  trend.  These  will 
be  removed  shortly.  The  data  stored  in  ITSM  now  consist  of  the  natural  logarithms  of 
the  original  data. 

□ 


Classical  Decomposition 

There  are  two  methods  provided  in  ITSM  for  the  elimination  of  trend  and  seasonality. 
These  are: 

i.  “classical  decomposition”  of  the  series  into  a  trend  component,  a  seasonal  com¬ 
ponent,  and  a  random  residual  component,  and 

ii.  differencing. 

Classical  decomposition  of  the  series  {Xr}  is  based  on  the  model 

LO 


Xt  —  mt  +  st  +  Yt , 

where  Xt  is  the  observation  at  time  t ,  mt  is  a  “trend  component,”  st  is  a  “seasonal 
component,”  and  Yt  is  a  “random  noise  component,”  which  is  stationary  with  mean 
zero.  The  objective  is  to  estimate  the  components  mt  and  st  and  subtract  them  from  the 
data  to  generate  a  sequence  of  residuals  (or  estimated  noise)  that  can  then  be  modeled 
as  a  stationary  time  series. 

To  achieve  this,  select  Transf  orm>Classical  and  you  will  see  the  Classical 
Decomposition  dialog  box.  To  remove  a  seasonal  component  and  trend,  check  the 
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Figure  E-2 

The  logged  AIRPASS.TSM 
series  after  removal  of  trend 
and  seasonal  components 
by  classical  decomposition 


Example  E.2.7. 
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Seasonal  Fit  and  Polynomial  Fit  boxes,  enter  the  period  of  the  seasonal 
component,  and  choose  between  the  alternatives  Quadratic  Trend  and  Linear 
Trend.  Click  OK,  and  the  trend  and  seasonal  components  will  be  estimated  and 
removed  from  the  data,  leaving  the  estimated  noise  sequence  stored  as  the  current 
data  set. 

The  estimated  noise  sequence  automatically  replaces  the  previous  data  stored  in 
ITSM. 

The  logged  airline  passenger  data  have  an  apparent  seasonal  component  of  period 
12  (corresponding  to  the  month  of  the  year)  and  an  approximately  quadratic  trend. 
Remove  these  using  the  option  Transf  orm>Classical  as  described  above.  (An 
alternative  approach  is  to  use  the  option  Regr e  s  s  i on,  which  allows  the  specification 
and  fitting  of  polynomials  of  degree  up  to  10  and  a  linear  combination  of  up  to  4  sine 
waves.) 

Figure  E-2  shows  the  transformed  data  (or  residuals)  Yt ,  obtained  by  removal  of 
trend  and  seasonality  from  the  logged  AIRPASS.TSM  series  by  classical  decomposi¬ 
tion.  {Yt}  shows  no  obvious  deviations  from  stationarity,  and  it  would  now  be  reason¬ 
able  to  attempt  to  fit  a  stationary  time  series  model  to  this  series.  To  see  how  well  the 
estimated  seasonal  and  trend  components  fit  the  data,  select  Transf orm> Show 
Classical  Fit.  We  shall  not  pursue  this  approach  any  further  here,  but  turn 
instead  to  the  differencing  approach.  (You  should  have  no  difficulty  in  later  returning 
to  this  point  and  completing  the  classical  decomposition  analysis  by  fitting  a  stationary 
time  series  model  to  |FJ.) 

□ 


Differencing 

Differencing  is  a  technique  that  can  also  be  used  to  remove  seasonal  components  and 
trends.  The  idea  is  simply  to  consider  the  differences  between  pairs  of  observations 
with  appropriate  time  separations.  For  example,  to  remove  a  seasonal  component  of 
period  12  from  the  series  {Xt},  we  generate  the  transformed  series 


E.2  Preparing  Your  Data  for  Modeling 


393 


Example  E.2.8. 


Figure  E-3 

The  series  AIRPASS.TSM 
after  taking  logs 
and  differencing 
at  lags  1 2  and  1 


Yt  —  Xt  —  Xt-\2. 

It  is  clear  that  all  seasonal  components  of  period  12  are  eliminated  by  this  trans¬ 
formation,  which  is  called  differencing  at  lag  12.  A  linear  trend  can  be  eliminated 
by  differencing  at  lag  1,  and  a  quadratic  trend  by  differencing  twice  at  lag  1  (i.e., 
differencing  once  to  get  a  new  series,  then  differencing  the  new  series  to  get  a  second 
new  series).  Higher-order  polynomials  can  be  eliminated  analogously.  It  is  worth 
noting  that  differencing  at  lag  12  eliminates  not  only  seasonal  components  with  period 
12  but  also  any  linear  trend. 

Data  are  differenced  in  ITSM  by  selecting  Transf orm>Dif f erence 
and  entering  the  required  lag  in  the  resulting  dialog  box. 

Restore  the  original  airline  passenger  data  using  the  option  File>Import  File 
and  selecting  AIRPASS.TSM.  We  take  natural  logarithms  as  in  Example  E.2.6 
by  selecting  Transf orm>Box- Cox  and  setting  A  =  0.  The  transformed 

series  can  now  be  deseasonalized  by  differencing  at  lag  12.  To  do  this  select 
Transf orm>Diff erence,  enter  the  lag  12  in  the  dialog  box,  and  click  OK. 
Inspection  of  the  graph  of  the  deseasonalized  series  suggests  a  further  differencing  at 
lag  1  to  eliminate  the  remaining  trend.  To  do  this,  repeat  the  previous  step  with  lag 
equal  to  1  and  you  will  see  the  transformed  and  twice-differenced  series  shown  in 
Figure  E-3. 

□ 


Subtracting  the  Mean 

The  term  ARMA  model  is  used  in  ITSM  to  denote  a  zero-mean  ARMA  process  (see 
Definition  3.1.1).  To  fit  such  a  model  to  data,  the  sample  mean  of  the  data  should 
therefore  be  small.  Once  the  apparent  deviations  from  stationarity  of  the  data  have  been 
removed,  we  therefore  (in  most  cases)  subtract  the  sample  mean  of  the  transformed 
data  from  each  observation  to  generate  a  series  to  which  we  then  fit  a  zero-mean 
stationary  model.  Effectively  we  are  estimating  the  mean  of  the  model  by  the  sample 
mean,  then  fitting  a  (zero-mean)  ARMA  model  to  the  “mean-corrected”  transformed 
data.  If  we  know  a  priori  that  the  observations  are  from  a  process  with  zero  mean,  then 
this  process  of  mean  correction  is  omitted.  ITSM  keeps  track  of  all  the  transformations 
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(including  mean  correction)  that  are  made.  When  it  comes  time  to  predict  the  original 
series,  ITSM  will  invert  all  these  transformations  automatically. 

Example  E.2.9.  Subtract  the  mean  of  the  transformed  and  twice-differenced  series  AIRPASS.TSM  by 

selecting  Transf  orm>Subtract  Mean.  To  check  the  current  model  status  press 
the  red  INFO  button,  and  you  will  see  that  the  current  model  is  white  noise  with 
variance  1,  since  no  model  has  yet  been  entered. 

□ 


E.3  Finding  a  Model  for  Your  Data 

After  transforming  the  data  (if  necessary)  as  described  above,  we  are  now  in  a  position 
to  fit  an  ARMA  model.  ITSM  uses  a  variety  of  tools  to  guide  us  in  the  search 
for  an  appropriate  model.  These  include  the  sample  ACF  (autocorrelation  function), 
the  sample  PACF  (partial  autocorrelation  function),  and  the  AICC  statistic,  a  bias- 
corrected  form  of  Akaike’s  AIC  statistic  (see  Section  5.5.2). 


E.3.1  Autofit 

Before  discussing  the  considerations  that  go  into  the  selection,  fitting,  and  checking  of 
a  stationary  time  series  model,  we  first  briefly  describe  an  automatic  feature  of  ITSM 
that  searches  through  ARMA (p,  q )  models  with  p  and  q  between  specified  limits  (less 
than  or  equal  to  27)  and  returns  the  model  with  smallest  AICC  value  (see  Sections  5.5.2 
and  E.3. 5).  Once  the  data  set  is  judged  to  be  representable  by  a  stationary  model, 
select  Model>Estimation>Autof  it.  A  dialog  box  will  appear  in  which  you 
must  specify  the  upper  and  lower  limits  for  p  and  q.  Since  the  number  of  maximum 
likelihood  models  to  be  fitted  is  the  product  of  the  number  of  p-values  and  the  number 
of  ^-values,  these  ranges  should  not  be  chosen  to  be  larger  than  necessary.  Once  the 
limits  have  been  specified,  press  Start,  and  the  search  will  begin.  You  can  watch  the 
progress  of  the  search  in  the  dialog  box  that  continually  updates  the  values  of  p  and 
q  and  the  best  model  found  so  far.  This  option  does  not  consider  models  in  which  the 
coefficients  are  required  to  satisfy  constraints  (other  than  causality)  and  consequently 
does  not  always  lead  to  the  optimal  representation  of  the  data.  However,  like  the  tools 
described  below,  it  provides  valuable  information  on  which  to  base  the  selection  of  an 
appropriate  model. 


E.3. 2  The  Sample  ACF  and  PACF 

Pressing  the  second  yellow  button  at  the  top  of  the  ITSM  window  will  produce  graphs 
of  the  sample  ACF  and  PACF  for  values  of  the  lag  h  from  1  up  to  40.  For  higher 
lags  choose  Statistics>ACF/PACF>Specify  Lag,  enter  the  maximum  lag 
required,  and  click  OK.  Pressing  the  second  yellow  button  repeatedly  then  rotates 
the  display  through  ACF,  PACF,  and  side-by-side  graphs  of  both.  Values  of  the  ACF 
that  decay  rapidly  as  h  increases  indicate  short-term  dependency  in  the  time  series, 
while  slowly  decaying  values  indicate  long-term  dependency.  For  ARMA  fitting  it 
is  desirable  to  have  a  sample  ACF  that  decays  fairly  rapidly.  A  sample  ACF  that  is 
positive  and  very  slowly  decaying  suggests  that  the  data  may  have  a  trend.  A  sample 
ACF  with  very  slowly  damped  periodicity  suggests  the  presence  of  a  periodic  seasonal 
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Example  E.3.1. 


Figure  E-4 

The  sample  ACF 
of  the  transformed  AIRPASS. 

TSM  series 


component.  In  either  of  these  two  cases  you  may  need  to  transform  your  data  before 
continuing. 

As  a  rule  of  thumb,  the  sample  ACF  and  PACF  are  good  estimates  of  the  ACF  and 
PACF  of  a  stationary  process  for  lags  up  to  about  a  third  of  the  sample  size.  It  is  clear 
from  the  definition  of  the  sample  ACF,  pQi),  that  it  will  be  a  very  poor  estimator  of 
p(h)  for  h  close  to  the  sample  size  n. 

The  horizontal  lines  on  the  graphs  of  the  sample  ACF  and  PACF  are  the  bounds 
±  1 .96/ If  the  data  constitute  a  large  sample  from  an  independent  white  noise 
sequence,  approximately  95  %  of  the  sample  autocorrelations  should  lie  between 
these  bounds.  Large  or  frequent  excursions  from  the  bounds  suggest  that  we  need  a 
model  to  explain  the  dependence  and  sometimes  to  suggest  the  kind  of  model  we  need 
(see  below).  To  obtain  numerical  values  of  the  sample  ACF  and  PACF,  right-click  on 
the  graphs  and  select  Info. 

The  graphs  of  the  sample  ACF  and  PACF  sometimes  suggest  an  appropriate 
ARMA  model  for  the  data.  As  a  rough  guide,  if  the  sample  ACF  falls  between  the 
plotted  bounds  ±1.96 /^/n  for  lags  h  >  q,  then  an  MA(g)  model  is  suggested,  while  if 
the  sample  PACF  falls  between  the  plotted  bounds  ±1.96 /  *Jn  for  lags  h  >  p,  then  an 
AR {p)  model  is  suggested. 

If  neither  the  sample  ACF  nor  PACF  “cuts  off”  as  in  the  previous  paragraph,  a 
more  refined  model  selection  technique  is  required  (see  the  discussion  of  the  AICC 
statistic  in  Section  5.5.2).  Even  if  the  sample  ACF  or  PACF  does  cut  off  at  some  lag, 
it  is  still  advisable  to  explore  models  other  than  those  suggested  by  the  sample  ACF 
and  PACF  values. 

Figure  E-4  shows  the  sample  ACF  of  the  AIRPASS. TSM  series  after  taking  logarithms, 
differencing  at  lags  12  and  1,  and  subtracting  the  mean.  Figure  E-5  shows  the 
corresponding  sample  PACF.  These  graphs  suggest  that  we  consider  an  MA  model 
of  order  12  (or  perhaps  23)  with  a  large  number  of  zero  coefficients,  or  alternatively 
an  AR  model  of  order  12. 

□ 
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Figure  E-5 

The  sample  PACF 
of  the  transformed  AIRPASS. 

TSM  series 
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E.3.3  Entering  a  Model 

A  major  function  of  ITSM  is  to  find  an  ARMA  model  whose  properties  reflect  to 
a  high  degree  those  of  an  observed  (and  possibly  transformed)  time  series.  Any 
particular  causal  ARMA (p,  q )  model  with  p  <  27  and  q  <  27  can  be  entered 
directly  by  choosing  Model >Specify,  entering  the  values  of  p ,  q ,  the  coeffi¬ 
cients,  and  the  white  noise  variance,  and  clicking  OK.  If  there  is  a  data  set  already 
open  in  ITSM,  a  quick  way  of  entering  a  reasonably  appropriate  model  is  to  use 
the  option  Model>Estimation>Preliminary,  which  estimates  the  coefficients 
and  white  noise  variance  of  an  ARMA  model  after  you  have  specified  the  orders  p  and 
q  and  selected  one  of  the  four  preliminary  estimation  algorithms  available.  An  optimal 
preliminary  AR  model  can  also  be  fitted  by  checking  Find  AR  mode  1  with  min 
AICC  in  the  Preliminary  Estimation  dialog  box.  If  no  model  is  entered  or 
estimated,  ITSM  assumes  the  default  ARMA(0,0),  or  white  noise,  model 

Xt  =  Zt , 

where  { Zt }  is  an  uncorrelated  sequence  of  random  variables  with  mean  zero  and 
variance  1. 

If  you  have  data  and  no  particular  ARMA  model  in  mind,  it  is  advisable  to  use 
the  option  Model>Estimation>Preliminary  or  equivalently  to  press  the  blue 
PRE  button  at  the  top  of  the  ITSM  window. 

Sometimes  you  may  wish  to  try  a  model  found  in  a  previous  session  or  a 
model  suggested  by  someone  else.  In  that  case  choose  Model>Specify  and  enter 
the  required  model.  You  can  save  both  the  model  and  data  from  any  project  by  selecting 
File>Project>S  ave  a  s  and  specifying  the  name  for  the  new  file.  When  the  new 
file  is  opened,  both  the  model  and  the  data  will  be  imported.  To  create  a  project  with 
this  model  and  a  new  data  set  select  File>Import  File  and  enter  the  name  of 
the  file  containing  the  new  data.  (This  file  must  contain  data  only.  If  it  also  contains 
a  model,  then  the  model  will  be  imported  with  the  data  and  the  model  previously  in 
ITSM  will  be  overwritten.) 
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E.3.4  Preliminary  Estimation 

The  option  Model>Estimation>Preliminary  contains  fast  (but  not  the  most 
efficient)  model-fitting  algorithms.  They  are  useful  for  suggesting  the  most  promising 
models  for  the  data,  but  should  be  followed  by  maximum  likelihood  estimation 
using  Model>Estimation>Max  likelihood.  The  fitted  preliminary  model  is 
generally  used  as  an  initial  approximation  with  which  to  start  the  nonlinear  optimiza¬ 
tion  carried  out  in  the  course  of  maximizing  the  (Gaussian)  likelihood. 

To  fit  an  ARMA  model  of  specified  order,  first  enter  the  values  of  p  and  q  (see  Sec¬ 
tion  2.6.1).  For  pure  AR  models  q  =  0,  and  the  preliminary  estimation  option  offers  a 
choice  between  the  Burg  and  Yule- Walker  estimates.  (The  Burg  estimates  frequently 
give  higher  values  of  the  Gaussian  likelihood  than  the  Yule-Walker  estimates.)  If  q  = 
0,  you  can  also  check  the  box  Find  AR  model  with  min  AICC  to  allow  the 
program  to  fit  AR  models  of  orders  0,  1,  . . . ,  27  and  select  the  one  with  smallest  AICC 
value  (Section  5.5.2).  For  models  with  q  >  0,  ITSM  provides  a  choice  between  two 
preliminary  estimation  methods,  one  based  on  the  Hannan-Rissanen  procedure  and 
the  other  on  the  innovations  algorithm.  If  you  choose  the  innovations  option,  a  default 
value  of  m  will  be  displayed  on  the  screen.  This  parameter  was  defined  in  Section  5.1.3. 
The  standard  choice  is  the  default  value  computed  by  ITSM.  The  Hannan-Rissanen 
algorithm  is  recommended  when  p  and  q  are  both  greater  than  0,  since  it  tends 
to  give  causal  models  more  frequently  than  the  innovations  method.  The  latter  is 
recommended  when  p  =  0. 

Once  the  required  entries  in  the  Preliminary  Estimation  dialog  box  have  been 
completed,  click  OK,  and  ITSM  will  quickly  estimate  the  parameters  of  the  selected 
model  and  display  a  number  of  diagnostic  statistics.  (If  p  and  q  are  both  greater  than 
0,  it  is  possible  that  the  fitted  model  may  be  noncausal,  in  which  case  ITSM  sets 
all  the  coefficients  to  .001  to  ensure  the  causality  required  for  subsequent  maximum 
likelihood  estimation.  It  will  also  give  you  the  option  of  fitting  a  model  of  different 
order.) 

Provided  that  the  fitted  model  is  causal,  the  estimated  parameters  are  given  with 
the  ratio  of  each  estimate  to  1.96  times  its  standard  error.  The  denominator  (1.96  x 
standard  error)  is  the  critical  value  (at  level  .05)  for  the  coefficient.  Thus,  if  the  ratio  is 
greater  than  1  in  absolute  value,  we  may  conclude  (at  level  .05)  that  the  corresponding 
coefficient  is  different  from  zero.  On  the  other  hand,  a  ratio  less  than  1  in  absolute 
value  suggests  the  possibility  that  the  corresponding  coefficient  in  the  model  may  be 
zero.  (If  the  innovations  option  is  chosen,  the  ratios  of  estimates  to  1.96  x  standard 
error  are  displayed  only  when  p  =  q  or  p  =  0.)  In  the  Preliminary  Estimates  window 
you  will  also  see  one  or  more  estimates  of  the  white  noise  variance  (the  residual 
sum  of  squares  divided  by  the  sample  size  is  the  estimate  retained  by  ITSM)  and 
some  further  diagnostic  statistics.  These  are  — 21nL(  </>,  6 ,  a2),  where  L  denotes  the 
Gaussian  likelihood  (5.2.9),  and  the  AICC  statistic 

— 21nL  +  2  (p  +  q  +  I  )n/{n  —  p  —  q  —  2) 

(see  Section  5.5.2). 

Our  eventual  aim  is  to  find  a  model  with  as  small  an  AICC  value  as  possible.  Small¬ 
ness  of  the  AICC  value  computed  in  the  preliminary  estimation  phase  is  indicative  of  a 
good  model,  but  should  be  used  only  as  a  rough  guide.  Final  decisions  between  models 
should  be  based  on  maximum  likelihood  estimation,  carried  out  using  the  option 
Model>Estimation>Max  likelihood,  since  for  fixed p  and  q ,  the  values  of 
6 ,  and  a2  that  minimize  the  AICC  statistic  are  the  maximum  likelihood  estimates, 
not  the  preliminary  estimates.  After  completing  preliminary  estimation,  ITSM  stores 
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the  estimated  model  coefficients  and  white  noise  variance.  The  stored  estimate  of  the 
white  noise  variance  is  the  sum  of  squares  of  the  residuals  (or  one-step  prediction 
errors)  divided  by  the  number  of  observations. 

A  variety  of  models  should  be  explored  using  the  preliminary  estimation  algo¬ 
rithms,  with  a  view  to  finding  the  most  likely  candidates  for  minimizing  AICC  when 
the  parameters  are  reestimated  by  maximum  likelihood. 

Example  E.3.2.  To  find  the  minimum-AICC  Burg  AR  model  for  the  logged,  differenced,  and  mean- 

corrected  series  AIRPASS.TSM  currently  stored  in  ITSM,  press  the  blue  PRE  button, 
set  the  MA  order  equal  to  zero,  select  Burg  and  Find  AR  model  with  min 
AICC,  and  then  click  OK.  The  minimum-AICC  AR  model  is  of  order  12  with  an  AICC 
value  of  —458. 13.  To  fit  a  preliminary  MA(25)  model  to  the  same  data,  press  the  blue 
PRE  button  again,  but  this  time  set  the  AR  order  to  0,  the  MA  order  to  25,  select 
Innovations,  and  click  OK. 

The  ratios  (estimated  coefficient)/(1.96x  standard  error)  indicate  that  the  coeffi¬ 
cients  at  lags  1  and  12  are  nonzero,  as  suggested  by  the  sample  ACF.  The  estimated 
coefficients  at  lags  3  and  23  also  look  substantial  even  though  the  corresponding  ratios 
are  less  than  1  in  absolute  value.  The  displayed  values  are  as  follows: 
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-0.6944 

0.2076 

-0.1065 

-0.3532 

-0.2147 

-0.0960 

-0.0402 

0.9475 

-0.3563 

-0.3659 

The  estimated  white  noise  variance  is  0.001 15  and  the  AICC  value  is  —440.93,  which 
is  not  as  good  as  that  of  the  AR(12)  model.  Later  we  shall  find  a  subset  MA(25)  model 
that  has  a  smaller  AICC  value  than  both  of  these  models. 

□ 


E.3.5  The  AICC  Statistic 

The  AICC  statistic  for  the  model  with  parameters  p,  q,  q b,  and  0  is  defined  (see 
Section  5.5.2)  as 

AICC (</>,  6)  =  —2  In  L(</>,  0,  S(</>,  0)/n)  +  2(p  +  q  +  I )n/(n-p  -  q  -  2), 

and  a  model  chosen  according  to  the  AICC  criterion  minimizes  this  statistic. 

Model- selection  statistics  other  than  AICC  are  also  available  in  ITSM.  A  Bayesian 
modification  of  the  AIC  statistic  known  as  the  BIC  statistic  is  also  computed  in  the 
option  Model>Estimation>Max  likelihood.  It  is  used  in  the  same  way  as 
the  AICC. 

An  exhaustive  search  for  a  model  with  minimum  AICC  or  BIC  value  can  be 
very  slow.  For  this  reason  the  sample  ACF  and  PACF  and  the  preliminary  estimation 
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techniques  described  above  are  useful  in  narrowing  down  the  range  of  models  to 
be  considered  more  carefully  in  the  maximum  likelihood  estimation  stage  of  model 
fitting. 


E.3. 6  Changing  Your  Model 

The  model  currently  stored  by  the  program  can  be  checked  at  any  time  by  selecting 
Model>Specify.  Any  parameter  can  be  changed  in  the  resulting  dialog  box, 
including  the  white  noise  variance.  The  model  can  be  filed  together  with  the  data  for 
later  use  by  selecting  File>Proj  ect >Save  as  and  specifying  a  file  name  with 
suffix  .TSM. 

Example  E.3. 3.  We  shall  now  set  some  of  the  coefficients  in  the  current  model  to  zero.  To  do  this  choose 

Model>Specify  and  click  on  the  box  containing  the  value  —0.35676  of  Theta(l). 
Press  Enter,  and  the  value  of  Theta(2)  will  appear  in  the  box.  Set  this  to  zero.  Press 
Enter  again,  and  the  value  of  Theta(3)  will  appear.  Continue  to  work  through  the 
coefficients,  setting  all  except  Theta(l),  Theta(3),  Theta(12),  and  Theta(23)  equal  to 
zero.  When  you  have  reset  the  parameters,  click  OK,  and  the  new  model  stored  in  ITSM 
will  be  the  subset  MA(23)  model 

Xt  =  Zt-  0.357Z,_!  -  0.163Z,_3  -  0A99Zt_12  +  0.201Z,_23, 
where  {Ztj  ~  WN(0,  0.00115). 

□ 


E.3. 7  Maximum  Likelihood  Estimation 

Once  you  have  specified  values  of  p  and  q  and  possibly  set  some  coefficients  to  zero, 
you  can  carry  out  efficient  parameter  estimation  by  selecting  Model  >Est  imat  ion> 
Max  like  1  i hood  or  equivalently  by  pressing  the  blue  MLE  button. 

The  resulting  dialog  box  displays  the  default  settings,  which  in  most  cases  will  not 
need  to  be  modified.  However,  if  you  wish  to  compute  the  likelihood  without  maxi¬ 
mizing  it,  check  the  box  labeled  No  optimization.  The  remaining  information 
concerns  the  optimization  settings.  (With  the  default  settings,  any  coefficients  that  are 
set  to  zero  will  be  treated  as  fixed  values  and  not  as  parameters.  Coefficients  to  be 
optimized  must  therefore  not  be  set  exactly  to  zero.  If  you  wish  to  impose  further 
constraints  on  the  optimization,  press  the  Constrain  optimization  button. 
This  allows  you  to  fix  certain  coefficients  or  to  impose  multiplicative  relationships 
on  the  coefficients  during  optimization.) 

To  find  the  maximum  likelihood  estimates  of  your  parameters,  click  OK,  and  the 
estimated  parameters  will  be  displayed.  To  refine  the  estimates,  repeat  the  estimation, 
specifying  a  smaller  value  of  the  accuracy  parameter  in  the  Maximum  Likelihood 
dialog  box. 

Example  E.3.4.  To  find  the  maximum  likelihood  estimates  of  the  parameters  in  the  model  for  the 

logged,  differenced,  and  mean-corrected  airline  passenger  data  currently  stored  in 
ITSM,  press  the  blue  MLE  button  and  click  OK.  The  following  estimated  parameters 
and  diagnostic  statistics  will  then  be  displayed: 
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ARMA  MODEL: 

X(t)  =  Z(0  +  (— -355)  *  Z(t  -  l)  +  (— .201)  *  Z(t  -  3)  +  (— .523)  *  Z(t  -  12) 

+(.242)  *  Z(t  -  23) 

WN  Variance  =  .001250 

MA  Coefficients 

THETA(  1)=  -.355078  THETA(  3)=  -.201125 
THETA(12)=  -.523423  THETA(23)=  .241527 
Standard  Error  of  MA  Coefficients 
THETA(  1):  .059385  THETA(  3):  .059297 
THETA(12):  .058011  THETA(23):  .055828 

(Residual  SS)/N  =  .  125024E-02 

AICC  =  -.486037E+03 
BIC  =  -.487622E+03 

-2  Ln(Likelihood)=  -.496517E+03 

Accuracy  parameter  =  .00205000 

Number  of  iterations  =  5 

Number  of  function  evaluations  =  46 

Optimization  stopped  within  accuracy  level. 

The  last  message  indicates  that  the  minimum  of  —2  In  L  has  been  located  with  the 
specified  accuracy.  If  you  see  the  message 
Iteration  limit  exceeded, 

then  the  minimum  of  — 21nL  could  not  be  located  with  the  number  of  iterations  (50) 
allowed.  You  can  continue  the  search  (starting  from  the  point  at  which  the  iterations 
were  interrupted)  by  pressing  the  MLE  button  to  continue  the  minimization  and 
possibly  increasing  the  maximum  number  of  iterations  from  50  to  100. 

□ 


E.3.8  Optimization  Results 

After  maximizing  the  Gaussian  likelihood,  ITSM  displays  the  model  parameters  (coef¬ 
ficients  and  white  noise  variance),  the  values  of  — 21nL,  AICC,  BIC,  and  information 
regarding  the  computations. 

Example  E.3.5.  The  next  stage  of  the  analysis  is  to  consider  a  variety  of  competing  models  and  to  select 

the  most  suitable.  The  following  table  shows  the  AICC  statistics  for  a  variety  of  subset 
moving  average  models  of  order  less  than  24. 


Lags 

AICC 

1 

3 

12 

23 

-486.04 

1 

3 

12 

13 

23 

-485.78 

1 

3 

5  12 

23 

-489.95 

1 

3 

12 

13 

-482.62 

1 

12 

-475.91 

The  best  of  these  models  from  the  point  of  view  of  AICC  value  is  the  one  with 
nonzero  coefficients  at  lags  1,  3,  5,  12,  and  23.  To  obtain  this  model  from  the  one 
currently  stored  in  ITSM,  select  Model>Specify,  change  the  value  of  THETA(5) 
from  zero  to  .001,  and  click  OK.  Then  reoptimize  by  pressing  the  blue  MLE  button  and 
clicking  OK.  You  should  obtain  the  noninvertible  model 
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Xt  =  zt-  0.434Z,_!  -  0.305Z,_3  +  0.238Z,_5  -  0.656Zt_n  +  0.351Z,_23, 

where  {Zt}  ~  WN(0,  0.00103).  For  future  reference,  file  the  model  and  data  as  AIR- 
PASS2.TSM  using  the  option  File>Proj  ect  >Save  as. 

□ 

The  next  step  is  to  check  our  model  for  goodness  of  fit. 


E.4  Testing  Your  Model 

Once  we  have  a  model,  it  is  important  to  check  whether  it  is  any  good  or  not.  Typically 
this  is  judged  by  comparing  observations  with  corresponding  predicted  values  obtained 
from  the  fitted  model.  If  the  fitted  model  is  appropriate  then  the  prediction  errors  should 
behave  in  a  manner  that  is  consistent  with  the  model.  The  residuals  are  the  rescaled 
one-step  prediction  errors, 

w,  =  (x,  -  xt)/^ru 

A 

where  Xt  is  the  best  linear  mean-square  predictor  of  Xt  based  on  the  observations  up 
to  time  t  —  1,  rt-i  =  E(Xt  —  Xt)2  /  a2  and  a2  is  the  white  noise  variance  of  the  fitted 
model. 

If  the  data  were  truly  generated  by  the  fitted  ARMA (p,  q )  model  with  white  noise 

A 

sequence  {ZJ,  then  for  large  samples  the  properties  of  { W^}  should  reflect  those  of  {Zt}. 
To  check  the  appropriateness  of  the  model  we  therefore  examine  the  residual  series 

A 

{Wt},  and  check  that  it  resembles  a  realization  of  a  white  noise  sequence. 

ITSM  provides  a  number  of  tests  for  doing  this  in  the  Residuals  Menu,  which 
is  obtained  by  selecting  the  option  Statistics>Residual  Analysis  .  Within 
this  option  are  the  suboptions 

Plot 

QQ-Plot  (normal) 

QQ-Plot  (t-distr) 

Histogram 
ACF/PACF 
ACF  Abs  vals/Squares 
Tests  of  randomness 


E.4.1  Plotting  the  Residuals 

Select  Statistics>Residual  Analysis>Histogram,  and  you  will  see 
a  histogram  of  the  rescaled  residuals,  defined  as 

Rt  =  wt/a, 

where  na2  is  the  sum  of  the  squared  residuals.  If  the  fitted  model  is  appropriate,  the 
histogram  of  the  rescaled  residuals  should  have  mean  close  to  zero.  If  in  addition  the 
data  are  Gaussian,  this  will  be  reflected  in  the  shape  of  the  histogram,  which  should 
then  resemble  a  normal  density  with  mean  zero  and  variance  1. 

Select  Statistics>Residual  Analysis>Plot  and  you  will  see  a  graph 

A 

of  Rt  vs.  t.  If  the  fitted  model  is  appropriate,  this  should  resemble  a  realization  of 
a  white  noise  sequence.  Look  for  trends,  cycles,  and  nonconstant  variance,  any  of 
which  suggest  that  the  fitted  model  is  inappropriate.  If  substantially  more  than  5  %  of 
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Figure  E-6 

Histogram  of  the 
rescaled  residuals 
from  AIRPASS.MOD 


Example  E.4.1. 


Example  E.4.2. 


the  rescaled  residuals  lie  outside  the  bounds  ±1.96  or  if  there  are  rescaled  residuals 
far  outside  these  bounds,  then  the  fitted  model  should  not  be  regarded  as  Gaussian. 

Compatibility  of  the  distribution  of  the  residuals  with  either  the  normal  distribution 
or  the  ^-distribution  can  be  checked  by  inspecting  the  corresponding  qq  plots  and 
checking  for  approximate  linearity.  To  test  for  normality,  the  Jarque-Bera  statistic  is 
also  computed. 

The  histogram  of  the  rescaled  residuals  from  our  model  for  the  logged,  differenced, 
and  mean-corrected  airline  passenger  series  is  shown  in  Figure  E-6.  The  mean  is  close 
to  zero,  and  the  shape  suggests  that  the  assumption  of  Gaussian  white  noise  is  not 
unreasonable  in  our  proposed  model. 

A 

The  graph  of  Rt  vs.  t  is  shown  in  Figure  E-7.  A  few  of  the  rescaled  residuals 
are  greater  in  magnitude  than  1.96  (as  is  to  be  expected),  but  there  are  no  obvious 
indications  here  that  the  model  is  inappropriate.  The  approximate  linearity  of  the 
normal  qq  plot  and  the  Jarque-Bera  test  confirm  the  approximate  normality  of  the 
residuals. 

□ 


E.4.2  ACF/PACF  of  the  Residuals 

If  we  were  to  assume  that  our  fitted  model  is  the  true  process  generating  the  data,  then 
the  observed  residuals  would  be  realized  values  of  a  white  noise  sequence. 

In  particular,  the  sample  ACF  and  PACF  of  the  observed  residuals  should  lie  within 
the  bounds  ±1.96 /  *Jn  roughly  95  %  of  the  time.  These  bounds  are  displayed  on  the 
graphs  of  the  ACF  and  PACF.  If  substantially  more  than  5  %  of  the  correlations  are 
outside  these  limits,  or  if  there  are  a  few  very  large  values,  then  we  should  look  for 
a  better-fitting  model.  (More  precise  bounds,  due  to  Box  and  Pierce,  can  be  found  in 
Brockwell  and  Davis  (1991)  Section  10.4.) 

Choose  Statistics>Residual  Ana lysis>ACF  / PACF, or  equivalently  press 
the  middle  green  button  at  the  top  of  the  ITSM  window.  The  sample  ACF  and  PACF 
of  the  residuals  will  then  appear  as  shown  in  Figures  E-8  and  E-9.  No  correlations 
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Figure  E-7 

Time  plot  of  the 
rescaled  residuals 
from  AIRPASS.MOD 


Figure  E-8 

Sample  ACF  of  the  residuals 
from  AIRPASS.MOD 


Example  E.4.3. 


O 

< 


Lag 


are  outside  the  bounds  in  this  case.  They  appear  to  be  compatible  with  the  hypothesis 
that  the  residuals  are  in  fact  observations  of  a  white  noise  sequence.  To  check  for 
independence  of  the  residuals,  the  sample  autocorrelation  functions  of  their  absolute 
values  and  squares  can  be  plotted  by  clicking  on  the  third  green  button. 

□ 


E.4.3  Testing  for  Randomness  of  the  Residuals 

The  option  Statistics>Residual  Analysis>Tests  of  Randomness 
carries  out  the  six  tests  for  randomness  of  the  residuals  described  in  Section  5.3.3. 

The  residuals  from  our  model  for  the  logged,  differenced,  and  mean-corrected  series 
AIRPASS.TSM  are  checked  by  selecting  the  option  indicated  above  and  selecting  the 
parameter  h  for  the  portmanteau  tests.  Adopting  the  value  h  =  25  suggested  by  ITSM, 
we  obtain  the  following  results: 
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Figure  E-9 

Sample  PACF  of 
the  residuals  from 
AIRPASS.MOD 


LL 
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RANDOMNESS  TEST  STATISTICS  (see  Section  5.3.3) 


E.5  Prediction 


LJUNG-BOX  PORTM.=  13.76 

CHISQUR(  20), 

p- value  =  0.843 

MCLEOD-LI  PORTM.=  17.39 

CHISQUR(  25), 

p- value  =  0.867 

TURNING  POINTS  =  87. 

ANORMAL(  86.00,  4.79* *2), 

p- value  =  0.835 

DIFFERENCE-SIGN  =  65. 

ANORMAL(  65.00,  3.32**2), 

p- value  =  1.000 

RANK  TEST  =  3934. 

ANORMAL(4257.50,  251.3**2), 

p- value  =  0.198 

JARQUE-BERA  =  4.33 

ORDER  OF  MIN  AICC 

CHISQUR(2) 

YW  MODEL  FOR  RESIDUALS  =  0 

p- value  =  0.115 

Every  test  is  easily  passed  by  our  fitted  model  (with  significance  level  a  =  0.05),  and 
the  order  of  the  minimum- AICC  AR  model  for  the  residuals  supports  the  compatibility 
of  the  residuals  with  white  noise.  For  later  use,  file  the  residuals  by  pressing  the  red 
EXP  button  and  exporting  the  residuals  to  a  file  with  the  name  AIRRES.TSM. 

□ 

One  of  the  main  purposes  of  time  series  modeling  is  the  prediction  of  future  observa¬ 
tions.  Once  you  have  found  a  suitable  model  for  your  data,  you  can  predict  future 
values  using  the  option  Forecasting>ARMA.  (The  other  options  listed  under 
Forecasting  refer  to  the  methods  of  Chapter  10.) 


E.5.1  Forecast  Criteria 

Given  observations  X\ ,  . . . ,  Xn  of  a  series  that  we  assume  to  be  appropriately  modeled 
as  an  ARMA (p,  q)  process,  ITSM  predicts  future  values  of  the  series  Xn+h  from  the 
data  and  the  model  by  computing  the  linear  combination  Pn(Xn+h)  of  X\,  . . . ,  Xn  that 
minimizes  the  mean  squared  error  E(Xn+h  —  Pn(Xn+h))2. 
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E.5.2  Forecast  Results 

Assuming  that  the  current  data  set  has  been  adequately  fitted  by  the  current 
ARMA(p,  q)  model,  choose  Forecasting>ARMA,  and  you  will  see  the  ARMA 
Forecast  dialog  box. 

You  will  be  asked  for  the  number  of  forecasts  required,  which  of  the  transfor¬ 
mations  you  wish  to  invert  (the  default  settings  are  to  invert  all  of  them  so  as  to 
obtain  forecasts  of  the  original  data),  whether  or  not  you  wish  to  plot  prediction 
bounds  (assuming  normality),  and  if  so,  the  confidence  level  required,  e.g.,  95  %.  After 
providing  this  information,  click  OK,  and  the  data  will  be  plotted  with  the  forecasts 
(and  possibly  prediction  bounds)  appended.  As  is  to  be  expected,  the  separation  of  the 
prediction  bounds  increases  with  the  lead  time  h  of  the  forecast. 

Right-click  on  the  graph,  select  Info,  and  the  numerical  values  of  the  predictors 
and  prediction  bounds  will  be  printed. 

Example  E.5.1.  We  left  our  logged,  differenced,  and  mean-corrected  airline  passenger  data  stored  in 

ITSM  with  the  subset  MA(23)  model  found  in  Example  D.3.5.  To  predict  the  next 
24  values  of  the  original  series,  select  Forecasting>ARMAand  accept  the  default 
settings  in  the  dialog  box  by  clicking  OK.  You  will  then  see  the  graph  shown  in 
Figure  E-10.  Numerical  values  of  the  forecasts  are  obtained  by  right-clicking  on  the 
graph  and  selecting  Info.  The  ARMA  Forecast  dialog  box  also  permits  using  a 
model  constructed  from  a  subset  of  the  data  to  obtain  forecasts  and  prediction  bounds 
for  the  remaining  observed  values  of  the  series. 

□ 


E.6  Model  Properties 

ITSM  can  be  used  to  analyze  the  properties  of  a  specified  ARMA  process  without 
reference  to  any  data  set.  This  enables  us  to  explore  and  compare  the  properties 
of  different  ARMA  models  in  order  to  gain  insight  into  which  models  might  best 
represent  particular  features  of  a  given  data  set. 
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Example  E.6.1. 


For  any  ARMA (p,  q )  process  or  fractionally  integrated  ARMA (p,  q )  process 
with  p  <  27  and  q  <  27,  ITSM  allows  you  to  compute  the  autocorrelation  and 
partial  autocorrelation  functions,  the  spectral  density  and  distribution  functions,  and 
the  MA(oo)  and  AR(oo)  representations  of  the  process.  It  also  allows  you  to  generate 
simulated  realizations  of  the  process  driven  by  either  Gaussian  or  non-Gaussian  noise. 
The  use  of  these  options  is  described  in  this  section. 

We  shall  illustrate  the  use  of  ITSM  for  model  analysis  using  the  model  for  the  trans¬ 
formed  series  AIRPASS.TSM  that  is  currently  stored  in  the  program. 

□ 


E.6.1  ARMA  Models 

For  modeling  zero-mean  stationary  time  series,  ITSM  uses  the  class  of  ARMA  (and 
fractionally  integrated  ARMA)  processes.  ITSM  enables  you  to  compute  characteris¬ 
tics  of  the  causal  ARMA  model  defined  by 

Xt  —  0 1  Xf—  i  +  (p2Xt-2  +  •  •  •  +  4>p  Xt-p  +  Zt  +  0iZ,_!  +  #2Z/-2  +  *  *  *  +  OqZt-q , 

or  more  concisely  <fi{B)Xt  —  Q(B)Zt ,  where  {Zt}  ~  WN  (0,  a2)  and  the  parameters  are 
all  specified.  (Characteristics  of  the  fractionally  integrated  ARIMA (p,  d ,  q)  process 
defined  by 

(1  -  B)d<p(B)Xt  =  9(B)Z„  \d\  <  0.5, 
can  also  be  computed.) 

ITSM  works  exclusively  with  causal  models.  It  will  not  permit  you  to  enter  a  model 

for  which  1  —  <p\z - 4>pzp  has  a  zero  inside  or  on  the  unit  circle,  nor  does  it  generate 

fitted  models  with  this  property.  From  the  point  of  view  of  second-order  properties,  this 
represents  no  loss  of  generality  (Section  3. 1).  If  you  are  trying  to  enter  an  ARMA (p,  q) 
model  manually,  the  simplest  way  to  ensure  that  your  model  is  causal  is  to  set  all  the 
autoregressive  coefficients  close  to  zero  (e.g.,  .001).  ITSM  will  not  accept  a  noncausal 
model. 

ITSM  does  not  restrict  models  to  be  invertible.  You  can  check  whether  or  not  the 
current  model  is  invertible  by  choosing  Model>Specify  and  pressing  the  button 
labeled  Causal/Invertible  in  the  resulting  dialog  box.  If  the  model  is  noninvertible,  i.e., 

if  the  moving-average  polynomial  1  +  0\z  H - b  0qzq  has  a  zero  inside  or  on  the  unit 

circle,  the  message  Non- invertible  will  appear  beneath  the  box  containing  the 
moving-average  coefficients.  (A  noninvertible  model  can  be  converted  to  an  invertible 
model  with  the  same  autocovariance  function  by  choosing  Model >Switch  to 
invertible.  If  the  model  is  already  invertible,  the  program  will  tell  you.) 


E.6.2  Model  ACF,  PACF 

The  model  ACF  and  PACF  are  plotted  using  Mode  1  >  ACF  /  PACF  >Mode  1 .  If  you  wish 
to  change  the  maximum  lag  from  the  default  value  of  40,  select  Model  > ACF  / PACF > 
Specify  Lag  and  enter  the  required  maximum  lag.  (It  can  be  much  larger  than  40, 
e.g.,  10,000).  The  graph  will  then  be  modified,  showing  the  correlations  up  to  the 
specified  maximum  lag. 

If  there  is  a  data  file  open  as  well  as  a  model  in  ITSM,  the  model  ACF  and  PACF 
can  be  compared  with  the  sample  ACF  and  PACF  by  pressing  the  third  yellow  button 
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Figure  E-1 1 

The  ACF  of  the  model  in 
Example  E.3.5  together  with 
the  sample  ACF 
of  the  transformed 
AIRPASS.TSM  series 


Figure  E-1 2 

The  PACF  of  the  model  in 
Example  E.3.5  together  with 
the  sample  PACF 
of  the  transformed 
AIRPASS.TSM  series 


Example  E.6.2. 


at  the  top  of  the  ITSM  window.  The  model  correlations  will  then  be  plotted  in  red,  with 
the  corresponding  sample  correlations  shown  in  the  same  graph  but  plotted  in  green. 
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The  sample  and  model  ACF  and  PACF  for  the  current  model  and  transformed  series 
AIRPASS.TSM  are  shown  in  Figures  E-ll  and  E-12.  They  are  obtained  by  pressing 
the  third  yellow  button  at  the  top  of  the  ITSM  window.  The  vertical  lines  represent  the 
model  values,  and  the  squares  are  the  sample  ACF/PACF.  The  graphs  show  that  the 
data  and  the  model  ACF  both  have  large  values  at  lag  12,  while  the  sample  and  model 
partial  autocorrelation  functions  both  tend  to  die  away  geometrically  after  the  peak  at 
lag  12.  The  similarities  between  the  graphs  indicate  that  the  model  is  capturing  some 
of  the  important  features  of  the  data. 

□ 
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Appendix  E  An  ITSM  Tutorial 


E.6.3  Model  Representations 

As  indicated  in  Section  3.1,  if  {XJ  is  a  causal  ARMA  process,  then  it  has  an  MA(oo) 
representation 

oo 

Xt  =  ^  ifjZt-j,  t  =  0,  ±1,  ±2,  . . . , 
j= 0 


oo 

where  IVO‘1  <  00  and  V'o  =  1. 

7=0 

Similarly,  if  {X;}  is  an  invertible  ARMA  process,  then  it  has  an  AR(oo)  represen¬ 
tation 

oo 

Zt  =  ^  7 TjXt-j,  t  —  0,  ±1,  ±2,  . . . , 

j= 0 

where  1 71  j\  <  oo  and  7To  =  1. 

For  any  specified  causal  ARMA  model  you  can  determine  the  coefficients  in 
these  representations  by  selecting  the  option  Model >AR/MA  Infinity.  (If  the 
model  is  not  invertible,  you  will  see  only  the  MA(oo)  coefficients,  since  the  AR(oo) 
representation  does  not  exist  in  this  case.) 

Example  E.6.3.  The  current  subset  MA(23)  model  for  the  transformed  series  AIRPASS.TSM  does  not 

have  an  AR(oo)  representation,  since  it  is  not  invertible.  However,  we  can  replace  the 
model  with  an  invertible  one  having  the  same  autocovariance  function  by  selecting 
Model >Switch  to  Invertible.  For  this  model  we  can  then  find  an  AR(oo) 
representation  by  selecting  Model>AR  Infinity.  This  gives  50  coefficients, 
the  first  20  of  which  are  shown  below. 


MA  —  Infinity 

j 

AR  —  Infinity 
psi(j) 

pK j) 

0 

1.00000 

1.00000 

1 

-0.36251 

0.36251 

2 

0.01163 

0.11978 

3 

-0.26346 

0.30267 

4 

-0.06924 

0.27307 

5 

0.15484 

-0.00272 

6 

-0.02380 

0.05155 

7 

-0.06557 

0.16727 

8 

-0.04487 

0.10285 

9 

0.01921 

0.01856 

10 

-0.00113 

0.07947 

11 

0.01882 

0.07000 

12 

-0.57008 

0.58144 

13 

0.00617 

0.41683 

14 

0.00695 

0.23490 

15 

0.03188 

0.37200 

16 

0.02778 

0.38961 

17 

0.01417 

0.10918 

18 

0.02502 

0.08776 

19 

0.00958 

0.22791 
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E.6.4  Generating  Realizations  of  a  Random  Series 

ITSM  can  be  used  to  generate  realizations  of  a  random  time  series  defined  by  the 
currently  stored  model. 

To  generate  such  a  realization,  select  the  option  Mode  1  >  S  imul  a  t  e,  and  you  will 
see  the  ARMA  Simulation  dialog  box.  You  will  be  asked  to  specify  the  number  of 
observations  required,  the  white  noise  variance  (if  you  wish  to  change  it  from 
the  current  value),  and  an  integer- valued  random  number  seed  (by  specifying  and 
recording  this  integer  with  up  to  nine  digits  you  can  reproduce  the  same  realization 
at  a  later  date  by  reentering  the  same  seed).  You  will  also  have  the  opportunity  to  add 
a  specified  mean  to  the  simulated  ARMA  values.  If  the  current  model  has  been  fitted  to 
transformed  data,  then  you  can  also  choose  to  apply  the  inverse  transformations  to  the 
simulated  ARMA  to  generate  a  simulated  version  of  the  original  series.  The  default 
distribution  for  the  white  noise  is  Gaussian.  However,  by  pressing  the  button  Change 
noise  distribution  you  can  select  from  a  variety  of  alternative  distributions 
or  by  checking  the  box  Use  Garch  model  for  noise  process  you  can 
generate  an  ARMA  process  driven  by  GARCH  noise.  Finally,  you  can  choose  whether 
the  simulated  data  will  overwrite  the  data  set  in  the  current  project  or  whether  they  will 
be  used  to  create  a  new  project.  Once  you  are  satisfied  with  your  choices,  click  OK, 
and  the  simulated  series  will  be  generated. 

Example  E.6.4.  To  generate  a  simulated  realization  of  the  series  AIRPASS.TSM  using  the  current 

model  and  transformed  data  set,  select  the  option  Model >Simulate.  The  default 
options  in  the  dialog  box  are  such  as  to  generate  a  realization  of  the  original  series  as 
a  new  project,  so  it  suffices  to  click  OK.  You  will  then  see  a  graph  of  the  simulated 
series  that  should  resemble  the  original  series  AIRPASS.TSM. 

□ 


E.6.5  Spectral  Properties 

Spectral  properties  of  both  data  and  fitted  ARMA  models  can  also  be  computed  and 
plotted  with  the  aid  of  ITSM.  The  spectral  density  of  the  model  is  determined 
by  selecting  the  option  Spectrum>Model.  Estimation  of  the  spectral  density 
from  observations  of  a  stationary  series  can  be  carried  out  in  two  ways,  either  by 
fitting  an  ARMA  model  as  already  described  and  computing  the  spectral  density 
of  the  fitted  model  (Section  4.4)  or  by  computing  the  periodogram  of  the  data 
and  smoothing  (Section  4.2).  The  latter  method  is  applied  by  selecting  the  option 
Spectrum>Smoothed  Periodogram.  Examples  of  both  approaches  are  given 
in  Chapter  4. 


E.7  Multivariate  Time  Series 

Observations  {xi,...,x„}  of  an  m- component  time  series  must  be  stored  as 
an  ASCII  file  with  n  rows  and  m  columns,  with  at  least  one  space  between 
entries  in  the  same  row.  To  open  a  multivariate  series  for  analysis,  select 
File>Pro j  ect  >Open>Multivariate  and  click  OK.  Then  double-click  on 
the  file  containing  the  data,  and  you  will  be  asked  to  enter  the  number  of  columns  (m) 
in  the  data  file.  After  doing  this,  click  OK,  and  you  will  see  graphs  of  each  component 
of  the  series,  with  the  multivariate  tool  bar  at  the  top  of  the  ITSM  screen.  For  examples 
of  the  application  of  ITSM  to  the  analysis  of  multivariate  series,  see  Chapter  8. 
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Generalized  inverse,  184,  271,  304 
Generalized  state-space  models 
Bayesian,  289 
filtering,  289,  290 
forecast  density,  289 
observation-driven,  294-295 
parameter-driven,  288-294 
prediction,  289,  290 

Geometric  Brownian  motion  (GBM),  195,  196, 
215-217,  381,382 
Gibbs  phenomenon,  114 

Goals  scored  by  England  against  Scotland,  299-302 
Goodness  of  fit  based  on  ACF,  18-19.  See  also  Tests 
of  randomness 

H 

Hannan-Rissanen  algorithm,  122,  136,  137-139 
Harmonic  regression,  10-12 
Hessian  matrix,  142,  187 
Hidden  process,  289 
Holt- Winters  algorithm,  314-317 
seasonal,  317-318 
Hypothesis  testing,  368-370 

large-sample  tests  based  on  confidence  regions, 
369-370 

uniformly  most  powerful  test,  369 

I 

IARCH(oo)  process,  209 
IGARCH(p,  q)  process,  208 
Independent  random  variables,  30,  36,  214 
Identification  techniques,  163-169 
for  ARMA  processes,  164 
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Identification  techniques  ( cont .) 
for  AR {p)  processes,  142 
for  MA(g)  processes,  153 
for  seasonal  ARIMA  processes,  111 
IGARCH(p,  q )  process,  208,  209 
iid  noise,  6-7,  14 
sample  ACF  of,  53 
multivariate,  235 
Innovations,  62,  27 1 
Innovations  algorithm,  62-65,  132-137 
fitted  innovations  MA(ra)  model,  133 
multivariate,  247 
Input,  45,  112,  333 

Integrated  volatility,  217,  218,  220,  226 
Intervention  analysis,  331-334 
Invertible 

ARM  A  process,  76 
multivariate  ARMA  process,  244 
Investment  strategy,  221,  222,  224 
Ito  calculus,  373 

Ito  integral,  343,  375-379 
Ito  process,  379-380 
Ito’s  formula,  380 

Ito  stochastic  differential  equation,  381-383 
ITSM,  27-33,  37,  122,  125-127,  165,  327,  329, 
385-407 


J 

Joint  distributions  of  a  time  series,  6 
Joint  distribution  of  a  random  vector,  355 

K 

Kalman  recursions 
filtering,  271,  274 
prediction,  27 1 
h— step,  272-275 
smoothing,  271,  275 
Kullback-Leibler  discrepancy,  151 
Kullback-Leibler  index,  151,  152 


L 

Lake  Huron  (LAKE.TSM),  9-10,  18-19,  54,  189, 
191 

Latent  process,  289 

Large-sample  tests  based  on  confidence  regions, 
369-370 

Least  squares  estimation 
for  ARMA  processes,  141 
for  regression  model,  186 
for  transfer  function  models,  326 
of  trend,  8 

Levy  process,  195,  212-218,  347-350,  375-377 
Levy-Ito  decomposition,  214 
Levy-Khinchin  representation,  374 
Levy  market  model  (LMM),  216 
Levy  measure,  374 

Likelihood  function,  277,  292,  366.  See  also 
Gaussian  likelihood 

Linear  combination  of  sinusoids,  101-103 

Linear  difference  equations,  47,  175 

Linear  filter,  36,  45,  48,  74 
input,  45 

low-pass,  23,  114 


moving- average,  22,  36 
output,  45 

simple  moving- average,  112-114 
Linear  process,  44,  335 
ACVF  of,  46 
Gaussian,  334 
multivariate,  235 

Linear  regression.  See  Regression 

Local  level  model,  264 

Local  linear  trend  model,  264,  265,  304,  315 

Log  asset  price,  195,  197,  212,  216 

Log  return,  195-197,  209,  219 

Logistic  equation,  335,  336 

Lognormal  SV  process,  210,  211-212,  274 

Long  memory,  207-209,  310,  323 

Long-memory  model,  338-342 

M 

MA(1)  process 
ACF  of,  42 

estimation  of  missing  values,  7 1 
moment  estimation,  128-129 
noninvertible,  85,  95 
order  selection,  128-129 
PACF  of,  84 
sample  ACF  of,  53 
spectral  density  of,  105-106 
state- space  representation  of,  273 
MA(g).  See  Moving  average  (MA(g))  process 
MA(oo),  44 

multivariate,  235 
Market  price  of  risk,  224 
Martingale  difference  sequence,  334 
Maximum  likelihood  estimation,  366-367 
ARMA  processes,  140 

large-sample  distribution  of,  142 
confidence  regions  for,  142-144 

Mean 

of  a  multivariate  time  series,  236 
estimation  of,  236-243 
of  a  random  variable,  352,  354 
of  a  random  vector,  357 
estimation  of,  50 
sample,  50 

large-sample  properties  of,  51,  236 
Mean  square  convergence,  65,  371-372 
properties  of,  372,  378 
Measurement  error,  84-86,  172 
Memory  shortening,  309,  310 
Method  of  moments  estimation,  85,  129 
Minimum  AICC  AR  model,  147,  256,  396, 
402 

Mink  trappings  (APPH.TSM),  256 
Missing  values  in  ARMA  processes 
estimation  of,  283-285 
likelihood  calculation  with,  281-283 
Mixture  distribution,  354 
Monte  Carlo  EM  algorithm  (MCEM),  293 
Moving  average  (MA(g))  process,  43 
ACF  of,  79 
sample,  82 
ACVF  of,  79 
estimation 

confidence  intervals,  143 
Hannan-Rissanen,  137 
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innovations,  133 
maximum  likelihood,  140,  142 
order  selection,  133,  134 
partial  autocorrelation  of,  83 
unit  roots  in,  171-173 
Multivariate  AR  process 
estimation,  247-249 
Burg’s  algorithm,  248 
maximum  likelihood,  248 
Whittle’s  algorithm,  248 
forecasting,  250-254 

error  covariance  matrix  of  prediction,  25 1 
Multivariate  ARMA  process,  243-246 
causal,  244 

covariance  matrix  function  of,  245-246 
estimation 

maximum  likelihood,  248 
invertible,  244 
prediction,  246-247 

error  covariance  matrix  of  prediction,  252 
Multivariate  innovations  algorithm,  247 
Multivariate  normal  distribution 
bivariate,  359-360 
conditional  distribution,  360 
conditional  expectation,  357 
density  function,  356 
definition,  358 
singular,  359 
standardized,  359 
Multivariate  time  series,  223 

covariance  matrices  of,  228,  233,  235 
mean  vectors  of,  228,  233,  235 
second-order  properties  of,  232-236 
stationary,  233 

Multivariate  white  noise,  227,  235,  243 
Muskrat  trappings  (APPI.TSM),  256 

N 

Negative  binomial  distribution,  292,  353,  361 

NILE.TSM,  340-342 

NOISE.TSM,  326,  333 

Non-anticipating  integrand,  376 

Nonlinear  models,  334-338 

Nonnegative  definite  matrix,  357,  358 

Nonnegative  definite  function,  41 

Normal  distribution,  352,  355 

Normal  equations,  363,  364 

Null  hypothesis,  34,  147,  170,  172,  337,  368, 

369 


O 

Observation  equation,  260 
of  CARMA(p,  q )  model,  345 
Ordinary  least  squares  (OLS)  estimators,  170,  184, 
185, 363 

One-step  predictors,  60,  63,  88,  136,  140,  149,  174, 
252,  271,280,  281,302 
Order  selection,  124,  133,  137,  141,  149-153 
AIC, 149,  152 
AICC,  149,  151-153 
BIC, 149,  152 
consistent,  152 
efficient,  152 
FPE,  149-150 


Omstein-Uhlenbeck  process,  217,  218,  220,  225,  343 
Ornstein-Uhlenbeck  SV  model,  219 
Orthogonal  increment  process,  103 
Orthonormal  set,  107 
Overdifferencing,  169,  171 
Overdispersed,  299 

Overshorts  (OSHORTS.TSM),  84-86,  128,  147,  148, 
172, 187-188 
structural  model  for,  85 

P 

Partial  autocorrelation  function  (PACF),  62,  83-84 
estimation  of,  85 
of  an  AR (p)  process,  83 
of  an  MA(1)  process,  84 
sample,  83,  84 

Periodogram,  106-111,  208,  340 
approximate  distribution  of,  108 
Point  estimate,  367 

Poisson  distribution,  296,  299,  300,  355 
Poisson  exponential  family  model,  296 
Poisson  process,  195,  213-215,  217,  351,  374 
Polynomial  fitting,  24-25 
Population  of  USA  (USPOPTSM),  4,  8-9,  25-26 
Portmanteau  test  for  residuals.  See  Tests  of 
randomness 

Posterior  distribution,  289,  294,  297 ,  298,  306, 

307 

Power  function,  369 
Power  steady  model,  298,  299 
Prediction  of  stationary  processes.  See  also  Recursive 
prediction 

AR {p)  processes,  89 
ARIMA  processes,  173-177 
ARMA  processes,  87-94 
based  on  infinite  past,  65 
best  linear  predictor,  40 
Gaussian  processes,  94 
prediction  bounds,  94 
large-sample  approximations,  93 
MA(g)  processes,  89 
multivariate  AR  processes,  250-254 
one-step  predictors,  57 

mean  squared  error  of,  92 
seasonal  ARIMA  processes,  182-183 
Prediction  operator,  58-60 
properties  of,  59 

Preliminary  transformations,  12,  20,  163 
Prewhitening,  239,  324 
Prior  distribution,  289 
Probability  density  function  (pdf),  354 
Probability  generating  function,  361 
Probability  mass  function  (pmf),  354 
Purely  nondetermini  Stic,  67,  334 

Q 

g— dependent,  43 

^-correlated,  43,  44 

qq  plot,  32,  147,  202-203,  401-402 

R 

R  and  S  arrays,  157 

Random  noise  component,  20,  389 
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Random  variable 

continuous,  353,  354 
discrete,  354,  355 

Randomly  varying  trend  and  seasonality  with  noise, 
266,317 

Random  vector,  355-358 
covariance  matrix  of,  357 
joint  distribution  of,  355 
mean  of,  357 

probability  density  of,  356 
Random  walk,  7,  14 

simple  symmetric,  7,  14 
with  noise,  263,  272,  278,  299 
Rational  spectral  density.  See  Spectral  density 
function 

Realization  of  a  time  series,  6,  409 
Realized  volatility,  217 
Recursive  prediction 

Durbin-Levinson  algorithm,  60,  247 
Innovations  algorithm,  64-65 
Kalman  prediction  ( see  Kalman  recursions) 
multivariate  processes 

Durbin-Levinson  algorithm,  247 
innovations  algorithm,  247 
Regression 

with  ARM  A  errors ,  184-191 

best  linear  unbiased  estimator,  185 
Cochrane  and  Orcutt  procedure,  185, 

186 

GLS  estimation,  185 
OLS  estimation,  184 
Rejection  region,  368,  369 
RES.TSM,  333,  334 
Residuals,  30,  144 

check  for  normality,  33,  147 
graph  of,  145 
rescaled,  145 
sample  ACF  of,  146 
tests  of  randomness  for,  146-147 
Risk-neutral,  224 

S 

Sales  with  leading  indicator  (LS2.TSM, 

SALES.TSM,  LEAD.TSM),  230- 
232,  242-243,  249-250,  253-254, 
326-327,  329-330 

Sample 

autocorrelation  function,  16-18 
MA(g),  82 
of  residuals,  146 
autocovariance  function,  16 
covariance  matrix,  16 
mean,  16 

large-sample  properties  of,  50 
multivariate,  228 
partial  autocorrelation,  83 
SARIMA.  See  Seasonal  ARIMA  process 
Seasonal  adjustment,  5 
Seasonal  ARIMA  process,  177-183 
forecasting,  182-183 

mean  squared  error  of,  183 
maximum  likelihood  estimation,  180,  181 
Seasonal  component,  20 
estimation  of,  21-25 
method  SI,  26-27 


elimination  of,  28 
method  S2,  28-30 

Seat-belt  legislation  (SBL.TSM,  SBL2.TSM), 
189-191,  333-334 
Second-order  properties,  6 
in  frequency  domain,  236 
Self-financing  condition,  221 
Short  memory,  313,  339 
SIGNAL.TSM,  3,  33 
Signal  detection,  3 
Significance  level,  153,  369,  402 
Size  of  a  test,  369 
Smoothing 

by  elimination  of  high-frequency  components, 
23-24 

with  a  moving  average  filter,  21-23 
exponential,  21,  23,  314,  316,  319 
the  periodogram  {see  Spectral  density  estimation) 
using  a  simple  moving  average,  112 
Spectral  density  estimation 
discrete  spectral  average,  109 
large- sample  properties  of,  110 
rational,  117 

Spectral  density  function,  98-106 
characterization  of,  99 
of  an  ARMA(1,  1),  116 
of  an  ARMA  process,  115 
ofanAR(l),  103-105 
of  an  AR(2),  116 
ofanMA(l),  105-106 
of  white  noise,  103 
properties  of,  98 
rational,  115 

Spectral  density  matrix  function,  236 
Spectral  distribution  function,  101-103,  117 
Spectral  representation 

of  an  autocovariance  function,  101 
of  a  covariance  matrix  function,  233 
of  a  stationary  multivariate  time  series,  233 
of  a  stationary  time  series,  97 
Spencer’s  15 -point  moving  average,  22-23,  36 
Spot  volatility,  217 
State  equation,  260 

of  CARMA(p,  q)  model,  345 
stable,  262,  267 
State-space  model,  259-307 
estimation  for,  275-280 
stable,  262 
stationary,  262 

with  missing  observations,  280-285 
State- space  representation,  261 
causal  AR (p),  266-267 
causal  ARMA (p,  q),  267-268 
ARIMA (p,  d,  q ),  268-269 
Stationarity 

multivariate,  227 
strict,  13,  43,  361 
weak,  13,  361 

Steady-state  solution,  273,  274,  305,  315 
Stochastic  differential  equation,  196,  215,  218,  343, 
345,  348,  383-385 
first-order,  343 
pth-order,  345 

Stochastic  volatility  model,  197,  209-212,  217-220, 
226,  274 

Stock  market  indices  (STOCK7.TSM),  225,  257 
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Strictly  stationary  series,  13,  43 
properties  of,  43 
Strike  price,  221 
Strike  time,  221 

Strikes  in  the  U.S.A.  (STRIKES. TSM),  4,  22,  36, 

96 

Structural  time  series  models,  85,  263 
level  model,  263-265 
local  linear  trend  model,  264,  265,  315 
randomly  varying  trend  and  seasonality 
with  noise,  266,  278 
estimation  of,  275-280 
seasonal  series  with  noise,  265 
Stylized  features,  196,  204 
Subordinator,  217 

Sunspot  numbers  (SUNSPOTS. TSM),  70,  86-87, 
110-111,  117,  153,203,204,  335 

T 

Testing  for  the  independence  of  two  stationary  time 
series,  239-240 

Test  for  normality,  33,  147,  400 
Tests  of  randomness 

based  on  sample  ACF,  30 
based  on  turning  points,  31-33,  146 
difference- sign  test,  32,  146 
Jarque-Bera  normality  test,  33,  146 
minimum  AICC  AR  model,  147 
portmanteau  tests 

Ljung-Box,  31,  146,  402 
McLeod-Li,  31,  146,  402 
rank  test,  32,  146 
Third-order  central  moment,  337 
Third-order  cumulant  function,  337,  350 
of  linear  process,  337,  350 
Threshold  model,  338 
AR (p),  338 
Time  domain,  97,  248 
Time-invariant  linear  filter  (TLF),  111-115 
causal,  112 
transfer  function,  113 
Time  series,  6 

continuous-time,  1 
discrete-time,  1 
Gaussian,  40,  42,  361 
Time  series  model,  6 
Time  series  of  counts,  292-294 
Transfer  function,  113-115 
Transfer  function  model,  323-330 
estimation  of,  324-326 
prediction  of,  327-330 


Transformations,  20,  163-164,  388 
variance-stabilizing,  165 
Tree-ring  widths  (TRINGS.TSM),  351 
Trend  component,  7-10,  20 
elimination  of 

in  absence  of  seasonality,  21-25 
by  differencing,  25-26 
estimation  of 

by  elimination  of  high-frequency  components, 
23 

by  exponential  smoothing,  23 

by  least  squares,  10 

by  polynomial  fitting,  24-25 

by  smoothing  with  a  moving  average,  21,  26 

U 

Uniform  distribution,  352,  353 
discrete,  353-354 

Uniformly  most  powerful  (UMP)  test,  369 
Unit  roots 

augmented  Dickey-Fuller  test,  170 

Dickey-Fuller  test,  170 

in  autoregression,  169-171 

in  moving-averages,  171-173 

likelihood  ratio  test,  172 

locally  best  invariant  unbiased  (FBIU)  test,  173 

V 

Variance,  352,  354 
Volatility,  196,  209,  216,  349 

W 

Weight  function,  109,  110 
White  noise,  14 
multivariate,  235 
spectral  density  of,  103 
Whittle  approximation  to  likelihood,  340 
Wold  decomposition,  44,  67,  334 

Y 

Yule- Walker  estimation,  86,  123-124.  See  also 

Autoregressive  process  and  multivariate 
AR  process 
for  q  >  0,  128-129 

Z 

Zoom  buttons,  388 


